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Preface 



Machine Learning has become a key enabling technology for many engineering 
applications, investigating scientific questions and theoretical problems alike. To 
stimulate discussions and to disseminate new results, a series of summer schools 
was started in February 2002. One year later two more of such summer schools 
were held, one at the Australian National University in Canberra, Australia, 
and the other one in the Max-Planck Institute for Biological Cybernetics, in 
Tubingen, Germany. 

The current book contains a collection of main talks held during those two 
summer schools, presented as tutorial chapters on topics such as Pattern Re- 
cognition, Bayesian Inference, Unsupervised Learning and Statistical Learning 
Theory. The papers provide an in-depth overview of these exciting new areas, 
contain a large set of references, and thereby provide the interested reader with 
further information to start or to pursue his own research in these directions. 

Complementary to the book, photos and slides of the presentations can be 
obtained at 

http : / /mlg . anu . edu . au/smnmer2003 

and 

http : / /www . irccyn . ec-nantes . f r/mlschool/mlss03/home03 . php . 

The general entry point for past and future Machine Learning Summer Schools 
is 

http ; //www. miss . cc 

It is our hope that graduate students, lecturers, and researchers alike will find 
this book useful in learning and teaching Machine Learning, thereby continuing 
the mission of the Machine Learning Summer Schools. 



Tubingen, June 2004 Olivier Bousquet 

Ulrike von Luxburg 
Gunnar Ratsch 
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An Introduction to Pattern Classification 



Elad Yom-Tov 

IBM Haifa Research Labs, University Campus, Haifa 31905, Israel 
yomtovSil . ibm . com 



1 Introduction 

Pattern classification is the field devoted to the study of methods designed to 
categorize data into distinct classes. This categorization can be either distinct 
labeling of the data (supervised learning), division of the data into classes (unsu- 
pervised learning), selection of the most significant features of the data (feature 
selection), or a combination of more than one of these tasks. 

Pattern classification is one of a class of problems that humans (under most 
circumstances) are able to accomplish extremely well, but are difficult for com- 
puters to perform. This subject has been under extensive study for many years. 
However during the past decade, with the introduction of several new classes of 
pattern classification algorithms this field seems to achieve performance much 
better than previously attained. 

The goal of the following article is to give the reader a broad overview of 
the field. As such, it attempts to introduce the reader to important aspects of 
pattern classification, without delving deeply into any of the subject matters. 
The exceptions to this rule are those points deemed especially important or 
those that are of special interest. Finally, we note that the focus of this article 
are statistical methods for pattern recognition. Thus, methods such as fuzzy 
logic and rule-based methods are outside the scope of this article. 



2 What Is Pattern Classification? 

Pattern classification, also referred to as pattern recognition, attempts to build 
algorithms capable of automatically constructing methods for distinguishing be- 
tween different exemplars, based on their differentiating patterns. 

Watanabe [53] described a pattern as ’’the opposite of chaos; it is an entity, 
vaguely defined, that could be given a name.” Examples of patterns are human 
faces, handwritten letters, and the DNA sequences that may cause a certain 
disease. More formally, the goal of a (supervised) pattern classification task is to 
find a functional mapping between the input data X, used to describe an input 
pattern, to a class label Y so that Y = f{X). Construction of the mapping is 
based on training data supplied to the pattern classification algorithm. The 
mapping / should give the smallest possible error in the mapping, i.e. the min- 
imum number of examples where Y will be the wrong label, especially on test 
data not seen by the algorithm during the learning phase. 
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An important division of pattern classification tasks are supervised as op- 
posed to unsupervised classification. In supervised tasks the training data 
consists of training patterns, as well as their required labeling. An example are 
DNA sequences labeled to show which examples are known to harbor a genetic 
trait and which ones do not. In unsupervised classification tasks the labels are 
not provided, and the task of the algorithm is to find a ’’good” partition of the 
data into clusters. Examples for this kind of task are grouping of Web pages into 
sets so that each set is concerned with a single subject matter. 

A pattern is described by its features. These are the characteristics of the 
examples for a given problem. For example, in a face recognition task some 
features could be the color of the eyes or the distance between the eyes. Thus, 
the input to a pattern recognition task can be viewed as a two-dimensional 
matrix, whose axes are the examples and the features. 

Pattern classification tasks are customarily divided into several distinct blocks. 
These are: 

1. Data collection and representation. 

2. Feature selection and/or feature reduction. 

3. Clustering. 

4. Classification. 

Data collection and representation are mostly problem-specific. Therefore it 
is difficult to give general statements about this step of the process. In broad 
terms, one should try to find invariant features, that describe the differences in 
classes as best as possible. 

Feature selection and feature reduction attempt to reduce the dimensionality 
(i.e. the number of features) for the remaining steps of the task. Clustering 
methods are used in order to reduce the number of training examples to the 
task. Finally, the classification phase of the process finds the actual mapping 
between patterns and labels (or targets) . In many applications not all steps are 
needed. Indeed, as computational power grows, the need to reduce the number 
of patterns used as input to the classification task decreases, and may therefore 
make the clustering stage superfluous for many applications. 

In the following pages we describe feature selection and reduction, clustering, 
and classification. 



3 Feature Selection and Feature Reduction: Removing 
Excess Data 

When data is collected for later classification, it may seem reasonable to assume 
that if more features describing the data are collected it will be easier to classify 
these data correctly. In fact, as Trunk [50] demonstrated, more data may be 
detrimental to classification, especially if the additional data is highly correlated 
with previous data. Furthermore, noisy and irrelevant features are detrimental to 
classification as they are known to cause the classifier to have poor generalization, 
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increase the computational complexity, and require many training samples to 
reach a given accuracy [4] . 

Conversely, selecting too few features will lead to the ugly duckling theorem 
[53], that is, it will be impossible to distinguish between the classes because there 
is too little data to differentiate the classes. For example, suppose we wish to 
classify a vertebrated animal into one of the vertebra classes (Mammals, Birds, 
Fish, Reptiles, or Amphibians). A feature that will tell us if the animal has skin is 
superfluous, since all vertebrates have skins. However, a feature that measures if 
the animal has warm blood is highly significant for the classification. A feature 
selection algorithm should be able to identify and remove the former feature, 
while preserving the latter. 

Hence the goal of this stage in the processing is to choose a subset of features 
or some combination of the input features that will best represent the data. We 
refer to the process of choosing a subset of the features as feature selection, 
and to finding a good combination of the features as feature reduction. 

Feature selection is a difficult combinatorial optimization problem. Finding 
the best subset of features by testing all possible combinations is practically 
impossible even when the number of input features is modest. For example, 
attempting to test all possible combinations of 100 input features will require 
testing 10^° combinations. It is not uncommon for text classification problems 
to have lO"* to 10^ features [27]. Consequently numerous methods have been 
proposed for finding a (suboptimal) solution by testing a fraction of the possible 
combinations. 

Feature selection methods can be divided into three main types [4]: 

1. Wrapper methods: The feature selection is performed around (and with) a 
given classification algorithm. The classification algorithm is used for ranking 
possible feature combinations. 

2. Embedded methods: The feature selection is embedded within the classifi- 
cation algorithm. 

3. Filter methods: Features are selected for classification independently of the 
classification algorithm. 

Most feature selection methods are of the wrapper type. The simplest algo- 
rithms in this category are the exhaustive search, which is practical only when the 
number of features is small, sequential forward feature selection (SFFS) 
and sequential backward feature selection (SBFS). In sequential forward 
feature selection the feature with which the lowest classification error is reached 
is selected. Then, the feature that, when added, causes the largest reduction in 
error is added to the set of selected features. This process is continued iteratively 
until the maximum number of features needed are found or until the classification 
error starts to increase. Although sequential feature selection does not assume 
dependence between features, it usually attains surprisingly reasonable results. 
There are several minor modifications to SFFS and SBFS, such as Sequential 
Floating Search [41] or the ’’Plus n, take away m” features. 

One of the major drawbacks of methods that select and add a single feature at 
each step is that they might not find combinations of features that perform well 
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together, but are poor predictors individually. More sophisticated methods for 
feature selection use simulated annealing or genetic algorithms [56] for solving 
the optimization problem of feature selection. The latter approach has shown 
promise in solving problems where the number of input features is extremely 
large. 

An interesting approach to feature selection is based in information theoretic 
considerations [25]. This algorithm estimates the cross-entropy between every 
pair of features, and discards those features that have a large cross-entropy with 
other features, thus removing features that add little additional classification in- 
formation. This is because the cross-entropy estimates the amount of knowledge 
that one feature provides on other features. The algorithm is appealing in that 
it is independent of the classification algorithm, i.e. it is a filter algorithm. How- 
ever, the need to estimate the cross entropy between features limits its use to 
applications where the datasets are large or to cases where features are discrete. 

As mentioned above, a second approach to reducing the dimension of the 
features is to find a lower-dimensional combination (linear or non-linear) of the 
features which represent the data as well as possible in the required dimension. 

The most commonly used technique for feature reduction is principal com- 
ponent analysis (PC A), also known as the Karhunen-Loeve Transform (KLT). 
PCA reshapes the data along the directions of maximal variance. PCA works 
by computing the eigenvectors corresponding to the largest eigenvalues of the 
covariance matrix of the data, and returning the projection of the data on these 
eigenvectors. An example of feature reduction using PCA is given in Figure 1. 




Fig. 1. Feature reduction using principle component analysis. The figure on the left 
shows the original data. Note that most of the variance in the data is along a single 
direction. The figure on the right shows probability density function of the same data 
after feature reduction to a dimension of 1 using PCA 

Principle component analysis does not take into account the labels of the 
data. As such, it is an unsupervised method. A somewhat similar, albeit su- 
pervised, linear method is the Fisher Discriminant Analysis (FDA). This 
method projects the data on a single dimension, while maximizing the separation 
between the classes of the data. 

A more sophisticated projection method is Independent Component 
Analysis (ICA)[8]. This method finds a linear mixture of the data, in the 
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same dimension of the data or lower. ICA attempts to find a mixture matrix 
such that each of the projections will be as independent as possible from the 
other projections. 

Instead of finding a linear mixture of the feature, it is also possible to find 
a nonlinear mixture of the data. This is usually done through modifications of 
the above-mentioned linear methods. Examples of such methods are nonlinear 
component analysis [33], nonlinear FDA [32], and Kernel PC A [46]. The latter 
method works by remapping data by way of a kernel function into feature space 
where the principle components of the data are found. 

As a final note on feature selection and feature reduction, one should note 
that as the ratio between the number of features and the number of training 
examples increases, it becomes likelier for a noisy and irrelevant feature to seem 
relevant for the specific set of examples. Indeed, feature selection is sometimes 
viewed as an ill-posed problem [52], which is why application of such methods 
should be performed with care. For example, if possible, the feature selection 
algorithm should be run several times, and the results tested for consistency. 



4 Clustering 

The second stage of the classification process endeavors to reduce the number 
of data points by clustering the data and finding representative data points (for 
example, cluster centers), or by removing superfluous data points. This stage is 
usually performed using unsupervised methods. 

A cluster of points is not a well-defined object. Instead, clusters are defined 
based on their environment and the scale at which the data is examined. Figure 2 
demonstrates the nature of the problem. Two possible definitions for clusters [23] 
are: (I) Patterns within a cluster are more similar to each other than are patterns 
belonging to other clusters. (II) A cluster is a volume of high-density points 
separated from other clusters by a relatively low density volumes. Both these 
definitions do not suggest a practical solution to the problem of finding clusters. 
In practice one usually specifies a criterion for joining points into clusters or the 
number of clusters to be found, and these are used by the clustering algorithm 
in place of a definition of a cluster. This practicality results in a major drawback 
of clustering algorithms: A clustering algorithm will find clusters even if there 
are no clusters in the data. 

Returning to the vertebrate classification problem discussed earlier, if we 
are given data on all vertebrate species, we may find that this comprises of 
too many training examples. It may be enough to find a representative sample 
for each of the classes and use it to build the classifier. Clustering algorithms 
attempt to find such representatives. Note that representative samples can be 
either actual samples drawn from the data (for example, a human as an example 
for a mammal) or an average of several samples (i.e. an animal with some given 
percentage of hair on its body as a representative mammal) . 

The computational cost of finding an optimal partition of a dataset into a 
given number of clusters is usually prohibitively high. Therefore, in most cases 
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Fig. 2. An example of data points for clustering. Many possible clustering configura- 
tions can be made for this data, based on the scale at which the data is examined, the 
shape of the clusters, etc 



clustering algorithms attempt to find a suboptimal partition in a reasonable 
number of computations. Clustering algorithms can be divided into Top-Down 
(or partitional) algorithms and Bottom-Up (or hierarchical) algorithms. 

A simple example for Bottom-Up algorithms is the Agglomerative Hi- 
erarchical Clustering Algorithm (AGHC). This algorithm is an iterative 
algorithm, which starts by assuming that each data point is a cluster. At each 
iteration two clusters are merged until a preset number of clusters is reached. 
The decision on which clusters are to be merged can be done using one of several 
functions, i.e. distance between cluster centers, distance between the two nearest 
points in different clusters, etc. AGHC is a very simple, intuitive scheme. How- 
ever, it is computationally intensive and thus impractical for medium and large 
datasets. 

Top-Down methods are the type more frequently used for clustering due 
to their lower computational cost, despite the fact that they usually find an 
inferior solution compared to Bottom-Up algorithms. Probably the most popular 
amongst Top-Down clustering algorithms in the K- means algorithm [28], a 
pseudo-code of which is given in figure 3. K-means is usually reasonably fast, but 
care should be taken in the initial setting of the cluster centers so as to attain a 
good partition of the data. There are probably hundreds of Top-Down clustering 
algorithms, but popular algorithms include fuzzy k-means [3], Kohonen maps 
[24], and competitive learning [44]. 

Recently, with the advent of kernel-based methods several algorithms for clus- 
tering using kernels have been suggested (e.g. [2]). The basic idea behind these 
algorithms is to map the data into a higher dimension using a non-linear function 
of the input features, and to cluster the data using simple clustering algorithms 
at the higher dimension. More details regarding kernels are given in the Classi- 
fication section of this paper. One of the main advantages of kernel methods is 
that simple clusters (for example, ellipsoid clusters) formed in a higher dimen- 
sion correspond to complex clusters in the input space. These methods seem to 
provide excellent clustering results, with reasonable computational costs. 
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A related class of clustering algorithms are the Spectral Clustering meth- 
ods [37, 11]. These methods first map the data into a matrix representing the 
distance between the input patterns. The matrix is then projected onto its k 
largest eigenvectors, and the clustering is performed on this projection. These 
methods demonstrated impressive results on several datasets, with computa- 
tional costs slightly higher than those of kernel-based algorithms. 



The K-means clustering algorithm 

1. Begin initialize N random cluster centers. 

2. Assign each of the data points the nearest of the N cluster centers. 

3. Recompute the cluster centers by averaging the points assigned to each cluster. 

4. Repeat steps 2-4 until the there is no change in the location of the cluster centers. 

5. Return the cluster centers. 



Fig. 3. Pseudo-code of the K-means clustering algorithm 



5 Classification 

Classification, the final stage of a pattern classifier, is the process of assigning la- 
bels to test patterns, based on previously labeled training patterns. This process 
is commonly divided into a learning phase, where the classification algorithm is 
trained, and a classification phase, where the algorithm labels new data. 

The general model for statistical pattern classification is one where patterns 
are drawn from an unknown distribution P, which depends on the label of the 
data (i.e., P{x\uji) i = 1, . . . ,N, where N is the number of labels in the data). 
During the learning phase the classification algorithm is trained with the goal 
of minimizing the error that will be obtained when classifying some test data. 
This error is known as the risk or the expected loss. 

When discussing the pros and cons of classification algorithms, it is important 
to set criteria for assessing these algorithms. In the following pages we describe 
several classification algorithms and later summarize (in table 1) their strong 
and weak points with regard to the following points: 

— How small are the classification errors reached by the algorithm? 

— What is the computational cost and the memory requirements for both train- 
ing and testing? 

— How difficult is it for a novice user to build and train an efficient classifier? 

— Is the algorithm able to learn on-line (i.e. as the data appears, allowing each 
data point to be addressed only once)? 

— Can one gain insight about the problem from examining the trained classi- 
fier? 

It is important to note that when discussing the classification errors of clas- 
sifiers one is usually interested in the errors obtained when classifying test data. 
Many classifiers can be trained to classify all the training data correctly. This 
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does not imply that they will perform well on unseen test data. In fact, it is more 
often the case that if a classification algorithm reduces the training set error to 
zeros it has been over-fitted to the training data. Section 6 discusses methods 
that can be applied in order to avoid over-fitting. A good classifier should be able 
to generalize from the training data, i.e. learn a set of rules from the training 
data that will then be used to classify the test data. 

The first question one should address in the context of classification is, is 
there an optimal classification rule (with regard to the classification error)? Sur- 
prisingly, such a rule exists, but in practice one can rarely use it. The optimal 
classification rule is the Bayes rule. Suppose that we wish to minimize the ex- 
pected loss function: R{uji\x) = {uji, ujj) P {ujj \x) where L is the loss function 
for deciding on class i given that the correct class is class j. If the zero/one loss 
is used (i.e. a wrong decision entails a loss of one, and a correct decision re- 
sults in a loss of zero) the Bayes rule simplifies to the Maximum Aposteriory 
(MAP) rule, which requires that we label an input sample x with the label i if 
P {uJi\x) > P {ijJj\x) for all j yf i. 

As mentioned above, it is usually impossible to use the Bayes rule because it 
requires full knowledge of the class-conditional densities of the data. Thus, one 
is frequently left with one of two options. If a model for the class-conditional 
densities is known (for example, if it is known that the data consists of two 
Gaussians for one class and a single uniform distribution for the other class) , one 
can use plug-in rules to build a classifier. Here, given the model, its parameters 
are estimated, and then the MAP rule can be used. If a model for the data 
cannot be provided, classifiers can proceed by estimating the density of the data 
or the decision boundaries between the different classes. 

The simplest plug-in model for data is to assume that each class of data is 
drawn from a single Gaussian. Under this assumption, the mean and variance of 
each class is estimated, and the labeling of test points is achieved through the 
MAP rule. If the data is known to contain more than one variate (e.g. Gaussian 
or uniform) distribution, the parameters of these distributions can be computed 
through algorithms such as Expectation-Maximization (EM) [12] algorithm. In 
order to operate the EM algorithm, the number of components in each class 
must be known in advance. This is not always simple, and an incorrect number 
might result in an erroneous solution. It is possible to alleviate this effect by 
estimating the number of components in the data using ML-II [29] or MDL [1]. 

Most classification algorithms do not attempt to find or even to approximate 
the Bayes decision region. Instead, these algorithms classify points by estimating 
decision regions or through estimation of densities. Arguably the simplest of 
these methods is the k-Nearest Neighbor classifier. Here the k points of the 
training data closest to the test point are found, and a label is given to the test 
point by a majority vote between the k points. This method is highly intuitive 
and attains a remarkably low classification errors, but it is computationally 
intensive and requires a large memory to store the training data. 

Another intuitive class of classification algorithms are decision trees. These 
algorithms solve the classification problem by repeatedly partitioning the input 
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Fig. 4. An example a decision tree. Classification of a new test point is achieved by 
moving from top to bottom along the branches of the tree, starting from the root node, 
until a terminal node (square) is reached 

space, so as to build a tree whose nodes are as pure as possible (that is, they 
contain points of a single class). An example of a tree for classifying vertebrates 
into classes is shown in 4. Classification of a new test point is achieved by moving 
from top to bottom along the branches of the tree, starting from the root node, 
until a terminal node is reached. Decision trees are simple yet effective classifi- 
cation schemes for small datasets. Large datasets tend to result in complicated 
trees, which in turn require a large memory for storage. There is considerable 
literature on methods for simplifying and pruning decision trees (for example 
[30]). Another drawback of decision trees is their relative sensitivity to noise, 
especially if the size of the training data is small. The most commonly used 
algorithms for building decision trees, all developed by Quinlan, are CART [6], 
IDS [42], and C4.5 [43]. 

An important approach to classification is through estimation of the den- 
sity of data for each of the classes and classifying test points according to the 
maximum posterior probability. A useful algorithm for density estimation is the 
Parzen windows estimation [39]. Parzen windows estimate the probability of a 
point in the input space by weighing training points using a Gaussian window 
function (the farther a training sample is from the test sample, the lower its 
weight). This method is, however, expensive both computationally and memory 
wise. Furthermore, many training points are required for correct estimation of 
the class densities. 

Another approach for classification is to optimize a functional mapping from 
input patterns to output labels so that the training error will be as small as 
possible. If, for example, we assume a linear mapping (i.e. that the classifier 
takes the form of a weighted sum of the input patterns), it is possible to find a 
closed-form solution to the optimization (under a least-squares criterion) through 
the Moore-Penrose pseudo-inverse. Suppose the training patterns are placed in 
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a matrix of size N x D where D is the input dimension and N the number of 
examples, and that the corresponding labels are placed in a iV x 1 vector T. We 
wish to find a weight vector w so that: 

P ■ w = T 

The least-squares (LS) solution to this problem is: 

Assuming that the labels of the data are either —1 or -|-1, the labeling of a 
new test point x will be: 



i = sign {vF ■ x) 



-1-1 if w'^ ■ X > 0 
— 1 if vF • a; < 0 



LS is extremely efficient in both memory requirement and computational 
effort, but it is usually too simplistic a model to obtain sufficiently good results 
for the data. 

The optimization approach to pattern classification has been utilized in nu- 
merous other algorithms. An interesting example is the use of Genetic Pro- 
gramming (GP) for classification. Genetic algorithms are computational models 
inspired by evolution [55]. As such, they encode potential solutions to an opti- 
mization problem as a chromosome-like data structure and apply recombination 
operators on these structures. These recombination operators are designed so as 
to gradually improve the solutions, much like evolution improves individuals in 
a population. In genetic programming the encoded solution is a function, and 
the goal is to search in function space for a mapping of inputs to labels that 
will reduce the training error. GP can sometimes find a very good solution with 
both a low error and small computational and memory requirements, but there 
is no proof that it will converge (At all or to a good solution) and thus it is not 
a popular algorithm. 

Perhaps one of the commonly used approaches to classification that solves an 
optimization problem are Neural Networks (NN). Neural networks (suggested 
first by Alan Turing [51]) are a computational model inspired by the connectiv- 
ity of neurons in animate nervous systems. A further boost to their popularity 
came with the proof that they can approximate any function mapping via the 
Universal Approximation Theorem [22] . A simple scheme for a neural network is 
shown in 5. Each circle denotes a computational element referred to as a neuron. 
A neuron computes a weighted sum of its inputs, and possibly performs a non- 
linear function on this sum. If certain classes of nonlinear functions are used, the 
function computed by the network can approximate any function (specifically 
a mapping from the training patterns to the training targets), provided enough 
neurons exist in the network. Gommon nonlinear functions are the sign function 
and the hyperbolic tangent. 

The architecture of neural networks is not limited to the feed-forward struc- 
ture shown in Figure 5. Many other structures have been suggested, such as 
recurrent NN, where the output is fed back as an input to the net, networks 
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Inputs Hidden Output 
layer layer 



Fig. 5. A schematic diagram of a neural network. Each circle in the hidden and output 
layer is a computational element known as a neuron 



with multiple outputs, and networks where each of the neurons activates only if 
the input pattern is in a certain region of the input space (an example of which 
are radial-basis function (RBF) networks). 

If a single neuron exists in a network, it is usually referred to as a perceptron. 
Perceptrons find a linear separating hyperplane and proof can be given to show 
that it will converge to a solution, if one exists. There are many algorithms for 
training (i.e. finding the weight vector for the perceptron): Batch and stochastic, 
on-line and off-line, with and without memory [15]. The perceptron is a good 
choice for an on-line linear classifier. It shares the same pros and cons as the LS 
classifier, with the additional drawback that it might not converge if no linear 
separation exists. However, for off-line applications it is usually simpler to use 
the LS algorithm. 

Multiple-layered NNs are far more difficult to train. Indeed this was a ma- 
jor obstacle in the development of NNs until an efficient algorithm for training 
was developed. This algorithm is known as the backpropagation algorithm, so- 
called because the errors that are the driving force in the training (if there is 
no error, there is no need to change the weights of the NN) are propagated 
from the output layer, through the hidden layers, to the input layer. This algo- 
rithm, whether in batch or stochastic mode, enables the network to be trained 
according to the need. Further advancement was attained through second-order 
methods for training, which achieve faster convergence. Among these we note 
the conjugate-gradient descent (CGD) algorithm [36] and Quickprop [18], both 
of which significantly accelerate network training. 

NNs have significant advantages in memory requirements and classification 
speed, and have shown excellent results on real-world problems [26]. Neverthe- 
less, they suffer from major drawbacks. Among these are the difficulty in deciding 
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Table 1. Comparison of the reviewed classification algorithms 



Algorithm 


Classification 

error 


Computational 

cost 


Memory 

requirements 


Difficulty to 
Implement 


On-line 


Insight from 
the classifier 


Expectation- 

Maximization 

(EM) 


Low 


Medium 


Small 


Low 


No 


Yes 


Nearest 

neighbor 


Medium- 

low 


High 


High 


Low 


No 


No 


Decision 

trees 


Medium 


Medium 


Medium 


Low 


No 


Yes 


Parzen 

windows 


Low 


High 


High 


Low 


No 


No 


Linear least 
squares (LS) 


High 


Low 


Low 


Low 


Yes 


Yes 


Genetic 

programming 


Medium- 

low 


Medium 


Low 


Low 


No 


Some 


Neural 

Networks 


Low 


Medium 


Low 


High 


Yes 


No 


Ada-Boost 


Low 


Medium 


Medium 


Medium 


No 


No 


Support vector 
machines 
(SVM) 


Low 


Medium 


Low 


Medium 


Yes 


Some 



on network architecture as well as several other network parameters, and that 
the resulting classifier is a ’’black box”, where it is difficult to understand why 
the network training resulted in a certain set of weights. Finally, contrary to 
other classification algorithms, efficient training of NNs is also dependent on 
several ’’tricks of the trade” such as normalizing the inputs, setting the initial 
weight values, etc. This makes it difficult for the novice to use NN effectively. 

An interesting and extremely useful approach to classification is to use simple 
classifiers as building blocks for constructing complicated decision regions. This 
approach is known as Boosting. Schematically, we first train a simple (or weak) 
classifier for the data. Then, those points of the train-set that are incorrectly 
classified are located and another weak classifier is trained so as to improve the 
classification of these incorrectly labeled points. This process is repeated until a 
sufficiently low training error is reached. 

The training of the weak classifiers can be performed by either drawing points 
from the training data with a probability inversely proportional to the distance 
of the points from the decision region or by selecting a cluster of the incorrectly 
trained points. In the first case, the algorithm is known as AdaBoost [19], which 
is the most popular boosting algorithm. In the second case, the algorithm is the 
Local Boosting algorithm [31]. 

Boosting has been shown to give very results on many data-sets. Its compu- 
tational cost is reasonably low, as are its memory requirements. Thus, boosting 
is one of the most useful classification algorithms. 

The last type of classification algorithm we discuss in this introduction is the 
Support Vector Machine (SVM) classifier. This classifier is the result of 
seminal work by Boser, Guyon, and Vapnik [5] and later others. SVM draws on 
two main practical observations: 
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1. At a sufficiently high dimension, patterns are orthogonal to each other, and 
thus it is easier to find a separating hyperplane for data in a high dimension. 

2. Not all patterns are necessary for finding a separating hyperplane. In fact, 
it is sufficient to use only those points that are near the boundary between 
groups for constructing the boundary. 

An SVM classifier is a linear classifier which finds the hyperplane that sepa- 
rates the data with the largest margin(i.e. the distance between the hyperplane 
and the closest data point) possible, built after transforming the data into a 
high dimension (known as the feature space). Let us begin with the second part 
of the process - the separating hyperplane. A linear separating hyperplane is a 
decision function in the form 



/ {x) = sign {{w, x) + b) 

where x is the input pattern, w is the weight vector, 6 is a bias term, and (•, •) 
denotes the inner product. 

If the data is to be classified correctly, this hyperplane should ensure that 
Ui ■ {{w, Xi) -I- 6) > 0 for all i = 1, . . . , m 
assuming that y G { — 1,-|-1}. 

There is one separating hyperplane that maximizes the margin separating 
the data, which is attractive since this hyperplane gives good generalization 
performance [46]. In order to find this hyperplane we need to minimize jju'll^. 

Thus the SVM problem can be written as: 

minimize | llrcjl^ 

subject to yi • ((w, Xi) -I- 6) > 1 for all i = 1, . . . ,m 

(The right hand side of the bottom equation was changed to one instead 
of zero otherwise the minimum of w would be the trivial solution. In fact, any 
positive number would suffice) 

This constrained minimization problem is solved using Lagrange multipliers, 
which results in a dual optimization problem: 

maximize W (a) = T^T,j=i (a^i, Xj) 

s.t. ai > 0, = 0 

The coefficients of a corresponding to input patterns that are not used for 
construction of the class boundary should be zero. The remaining coefficients are 
known as the support vectors. The above optimization problem can be solved 
in several ways, for example: Through a perceptron, which finds the largest 
margin hyperplane separating the data[15]; By use of quadratic programming 
optimization algorithms, which solve the optimization problem [15]; or through 
other efficient optimization algorithms such as the sequential minimal optimiza- 
tion (SMO) algorithm [40]. 
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Classification of new samples is performed using the equation 

C m 

{x,Xi) + b 

i=i 

As noted above, it is useful to map the input patterns into a high dimensional 
space. This could be done by mapping each input pattern through a function 
so that X = f{x). However, in practice, if the function maps the data into 
a very high dimension, it would be problematic to compute and to store the 
results of the mapping. If the mapping is done into an infinite dimensional space 
this would be impossible. Fortunately, this problem can be avoided through a 
substitute known as the kernel trick [5]. Note that in the optimization problem 
above, the input patterns only appear in an inner product of pairs of patterns. 
Thus, instead of mapping each sample to a higher dimension and then performing 
the inner product, it is possible (for certain classes of kernels) to first compute 
the inner product between patterns and only then compute the mapping on a 
scalar. Thus, in the equations above we now replace the inner products (x, x') 
with k{x, x') where k is the kernel function. The kernel function used for mapping 
should conform to conditions known as the Mercer conditions [22] . Examples of 
such functions are polynomials, radial basis functions (Gaussian functions), and 
hyperbolic tangents. 

SVMs have been studied extensively [46]. They have been extended in many 
directions. Some notable examples include: 

1. Cases where the optimal hyperplane does not exist, through the introduction 
of a penalty term which allows some training patterns to be incorrectly 
classified [10]. 

2. Single class learning (outlier detection) [45]. 

3. Online learning [17]. 

4. Feature selection [20, 54]. 

5. Incremental classification so as to reduce the computational cost of SVMs [7]. 

It is difficult to find thorough comparative studies of classification algorithms. 
Several such studies (for example [33, 46]) point to the conclusion that a few 
classification algorithms, namely SVM, AdaBoost, Kernel Fisher discriminant, 
and Neural networks achieve similar results with regard to error rates. Lately, 
the Relevance Vector Machine [49], a kernel method stemming from Bayesian 
learning, has also joined this group of algorithms. However, these algorithm differ 
greatly in the other factors outlined at the beginning of this chapter. 

Finally, we note several practical points of importance which one should take 
into account when designing classifiers: 

1. In order to reduce the likelihood of over- fitting the classifier to the training 
data, the ratio of the number of training examples to the number of features 
should be at least 10:1. For the same reason the ratio of the number of train- 
ing examples to the number of unknown parameters should be at least 10:1. 
2. It is important to use proper error-estimation methods (see next section), 
especially when selecting parameters for the classifier. 
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3. Some algorithms require the input features to be scaled to similar ranges. 
This is especially evident those that use some kind of a weighted average of 
the inputs such as neural networks, SVM, etc. 

4. There is no single best classification algorithm! 

Thus far we have implicitly only discussed problems where there are two 
classes of data, i.e. the labels can take one of two values. In many applications 
it is necessary to distinguish between more than two classes. Some classifiers 
are suitable for such applications with only minor changes. Examples of such 
classifiers are the LS classifier, the Nearest Neighbor classifier, and decision trees. 
Neural networks require a minor modification to work with multiclass problems. 
Instead of having a single output neuron there should be as many output neurons 
as labels. Each of the output neurons is trained to respond to data of one class, 
and the strongest activated neuron is taken to be the predicted class label. SVMs 
have been modified to solve multiclass problems through a slight change in the 
objective function to the minimization procedure [46]. 

Not all classifiers are readily modifiable to multiclass applications. The strat- 
egy for solution of such cases is to train several classifiers and add a gating 
network that decides on the predicted label based on the output of these clas- 
sifiers. The simplest example of such a strategy is to train as many classifiers 
as classes where each classifier is trained to respond to one class of data. The 
gating network then outputs the number of the classifier that responded to a 
given input. This type of solution is called a one-against-all solution. The main 
drawbacks of this solution are that it is heuristic, that the classifiers are solving 
problems that are very different in their difficulty, and that, if the output of 
the classifiers is binary, there might be more than one possible class for each 
output. A variation on the one-against-all solution is to train classifiers to dis- 
tinguish between each pair of classes[21]. This solution has the advantage that 
the individual classifiers are trained on smaller datasets. The main drawback 
of this solution is the large number of classifiers that are needed to be trained 
((iV-l)fV/2). 

An elegant solution to multiclass problems was suggested in [13]. That ar- 
ticle showed the parallel between multiclass problems and the study of error- 
correcting codes for communication applications. In the latter, bits of data are 
sent over a noisy channel. At the receiver, the data is reconstructed through 
thresholding of the received bit. In order to reduce the probability of error, ad- 
ditional bits of data are sent to the receiver. These bits are a function of the 
data bits, and are designed so that they can correct errors that occurred dur- 
ing transmission (if only a small numbers of error appeared). The functions by 
which these extra bits are computed are known as error-correcting codes. The 
application of error-correcting codes to multiclass problems is straightforward. 
Classifiers are trained according to an error-correcting code, and their output 
to test patterns is interpreted as though they were the received bits of informa- 
tion. This solution requires the addition of classifiers according to the specific 
error-correcting code (for example, the simple Hamming code requires 2^ — IV — 1 
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classifiers for N classes of data), but if a few of the classifiers are in error, the 
total output would still be correct. 

In practice, the one-against-all method is usually not much worse than the 
more sophisticated approaches described above. When selecting the solution, 
one should consider training times and the available memory, in addition to the 
overall accuracy of the system. 

The last topic we address in the context of classification is one-class learning. 
This is an interesting philosophical subject, as well as an extremely practical 
one. We usually learn by observing different examples (Car vs. Plane, Cat vs. 
Dog, etc). Is it possible to learn by observing examples of only a single class? 
(e.g. would a child who only saw cats be able to say that a dog, first seen, is not 
a cat?). In the framework of classification, the object of single-class learning is to 
distinguish between objects of one kind (the target object) and all other possible 
objects) [48], where the latter are not seen during training. Single-class learning 
has been applied to problems such as image retrieval[9], typist identification [38], 
and character recognition [46]. 

The idea behind single-class learning is to identify areas where the data rep- 
resenting the target object is of high density. If a test sample appears close to 
(or inside) such a high-density area, it would be classified as a target object. If 
it is in a low-density area of the input space, it would be classified as a different 
object. 

The simplest type of single-class algorithm describes the data by a single 
Gaussian (with a mean and a covariance matrix). The probability estimate that 
a test sample is drawn from this Gaussian is computed, and this measure is 
reported to the user. A more sophisticated measure is the Parzen windows esti- 
mation of density, or through the use of a multi-Gaussian model, with EM used 
for training. Neural networks have also been used for this task training the net- 
work to form closed decision surfaces, and labeling points outside these surfaces 
as non-target-class data [34]. 

More recently, single-class SVMs were developed [47]. These are a modification 
of the two-class SVM described above, with the SVM attempting to enclose the 
data with a sphere in feature space. Any data falling outside this sphere is 
deemed not to be of the target class. 



6 Error Estimation Techniques 

As noted above, the most important factor in the performance of a classifier is 
its error rate. This measure is important for assessing if the classifier is useful, 
for tuning its parameters [35], and in order to compare it to other classifiers. It is 
often difficult to estimate the error rate of a given classifier even if there is full 
knowledge of the underlying distribution is available. 

In practice, it is desirable to estimate the error rate given a sample data set. 
This problem is aggravated if the dataset is small[23]. If the whole dataset is 
used to both training the classifier and for estimating its error, there is a serious 
danger of over-fitting the classifier to the training data (in the extreme case. 
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consider a 1-Nearest Neighbor classifier). Therefore, the data should be split 
into training data and testing data. 

There are three main methods for splitting the data: 

1. Resubstitution: Here the whole dataset is used for both training and testing. 
As noted above this method is extremely optimistic. In practice, for small 
datasets error estimation obtained using this method is erroneous. 

2. Holdout: Part of the data (for example, 80%) is used for training, and the 
remaining is used for testing. This method is pessimistically biased, and 
different splits of the data will result of different error rates. 

3. Cross-validation: The data is divided into N equal sub-sets. The data is 
trained using (N-1) sub-sets, and tested on the N-th subset. The process is 
repeated until each of the N sub-sets is used as a test set. The error rate is 
the average of the N resulting errors. The resulting error rate has a lower 
bias than the holdout method. An extreme form of cross-validation is known 
as leave-one-out, where the sub-sets contain a single point. The estimate of 
leave-one-out is unbiased but it has a large variance and is computationally 
expensive to compute. 

After computing the error rate of a classifier, we have an estimation of how 
well the algorithm will perform on new data. The algorithm should then be 
trained using the whole dataset, in preparation of new data. 

Although it is desirable to use the error rate as a way to compare the per- 
formance of different classification algorithms, this is (surprisingly) still an open 
issue for future research. Some researchers have used the Wilcoxon signed-rank 
tests for such comparison, although the underlying assumptions of this test are 
violated when it is used for such a comparison [14]. 



7 Summary 

The purpose of pattern classification algorithms is to automatically construct 
methods for distinguishing between different exemplars, based on their differen- 
tiating patterns. 

The goal of completely automated learning algorithms is yet to be attained. 
Most pattern classification algorithms need some manual parameter tuning to 
achieve the best possible performance. More importantly, in most practical appli- 
cations, domain knowledge remains crucial for the successful operation of Pattern 
Classification algorithms. 

Pattern classification has been an object of research for several decades. In 
the past decade this research resulted in a multitude of new algorithms, better 
theoretic understanding of previous ideas, as well as many successful practical 
applications. 

Pattern classification remains an exciting domain for theoretic research, as 
well as for application of its’ tools to practical problems. Some of the problems 
yet to be solved were outlined in previous paragraphs, and a more detailed list 
appears in [16]. 
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Abstract. This chapter describes Lagrange multipliers and some 
selected subtopics from matrix analysis from a machine learning per- 
spective. The goal is to give a detailed description of a number of math- 
ematical constructions that are widely used in applied machine learning. 



1 Introduction 

The topics discussed in this chapter are ones that I felt are often assumed in ap- 
plied machine learning (and elsewhere), but that are seldom explained in detail. 
This work is aimed at the student who’s taken some coursework in linear meth- 
ods and analysis, but who’d like to see some of the tricks used by researchers 
discussed in a little more detail. The mathematics described here is a small 
fraction of that used in machine learning in general (a treatment of machine 
learning theory would include the mathematics underlying generalization error 
bounds, for example)^, although it’s a largely self-contained selection, in that 
derived results are often used downstream. I include two kinds of homework, 
‘exercises’ and ‘puzzles’. Exercises start out easy, and are otherwise as you’d 
expect; the puzzles are exercises with an added dose of mildly Machiavellian 
mischief. 

Notation: vectors appear in bold font, and vector components and matrices 
in normal font, so that for example denotes the f’th component of the 
a’th vector The symbol A 0 (^) means that the matrix A is positive 
(semi)definite. The transpose of the matrix A is denoted A^, while that of the 
vector X is denoted x'. 



2 Lagrange Multipliers 

Lagrange multipliers are a mathematical incarnation of one of the pillars of 
diplomacy (see the historical notes at the end of this section): sometimes an 
indirect approach will work beautifully when the direct approach fails. 



^ My original lectures also contained material on functional analysis and convex opti- 
mization, which is not included here. 



O. Bousquet et al. (Eds.): Machine Learning 2003, LNAI 3176, pp. 21—40, 2004. 
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2.1 One Equality Constraint 

Suppose that you wish to minimize some function /(x), x G TZ'^, subject to 
the constraint c(x) = 0. A direct approach is to find a parameterization of 
the constraint such that /, expressed in terms of those parameters, becomes 
an unconstrained function. For example, if c(x) = x'Ax — 1, x S and if 
A 0, you could rotate to a coordinate system and rescale to diagonalize the 
constraints to the form y'y = 1 , and then substitute with a parameterization 
that encodes the constraint that y lives on the {d — l)-sphere, for example 

Ui = sin6li sin6»2 • • • sin0d-2 sin0d_i 
y2 = sin6li sin02 • • • sin0d-2 cos6*d_i 
2/3 = sin 6*1 sin 6*2 •• • cos 64-2 



Unfortunately, for general constraints (for example, when c is a general poly- 
nomial in the d variables) this is not possible, and even when it is, the above 
example shows that things can get complicated quickly. The geometry of the 
general situation is shown schematically in Figure 1. 





Fig. 1. At the constrained optimum, the gradient of the constraint must be parallel to 
that of the function 

On the left, the gradient of the constraint is not parallel to that of the func- 
tion; it’s therefore possible to move along the constraint surface (thick arrow) so 
as to further reduce /. On the right, the two gradients are parallel, and any mo- 
tion along c(x) = 0 will increase /, or leave it unchanged. Hence, at the solution, 
we must have V/ = AVc for some constant A; A is called a,n ( . ) 

, where ‘undetermined’ arises from the fact that for some 
problems, the value of A itself need never be computed. 

2.2 Multiple Equality Constraints 

How does this extend to multiple equality constraints, Ci{x) = 0, i = 1 , . . . ,n? 
Let gi = Vci. At any solution x*, it must be true that the gradient of / has no 
components that are perpendicular to all of the gi, because otherwise you could 
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move X* a little in that direction (or in the opposite direction) to increase (de- 
crease) / without changing any of the Cj, i.e. without violating any constraints. 
Hence for multiple equality constraints, it must be true that at the solution x*, 
the space spanned by the gi contains the vector V/, i.e. there are some constants 
Aj such that V/(x*) = Aigi(x»). Note that this is not sufficient, however - 
we also need to impose that the solution is on the correct constraint surface 
(i.e. Ci = 0 Vi). A neat way to encapsulate this is to introduce the Lagrangian 
L = /(x) — AiCi(x), whose gradient with respect to the x, and with respect 
to all the Ai, vanishes at the solution. 

Puzzle 1: 



Exercise 1. 

r 






i 



^ ) 



Exercise 2. 



2.3 Inequality Constraints 

Suppose that instead of the constraint c(x) = 0 we have the single constraint 
c(x) < 0. Now the entire region labeled c(x) < 0 in Figure 1 has become feasible. 
At the solution, if the constraint is active (c(x) = 0), we again must have that 
V/ is parallel to Vc, by the same argument. In fact we have a stronger condition, 
namely that if the Lagrangian is written L = f+Xc, then since we are minimizing 
/, we must have A > 0, since the two gradients must point in opposite directions 
(otherwise a move away from the surface c = 0 and into the feasible region would 
further reduce /). Thus for an inequality constraint, the sign of A matters, and 
so here A > 0 itself becomes a constraint (it’s useful to remember that if you’re 
minimizing, and you write your Lagrangian with the multiplier appearing with 
a positive coefficient, then the constraint is A > 0). If the constraint is 
active, then at the solution V/(x*) = 0, and if Vc(x») yf 0, then in order that 
VL(x*) = 0 we must set A = 0 (and if in fact if Vc(x») = 0, we can still 
set A = 0). Thus in either case (active or inactive), we can find the solution 
by requiring that the gradients of the Lagrangian vanish, and we also have 
Ac(x*) = 0. This latter condition is one of the important Karush-Kuhn-Tucker 
conditions of convex optimization theory [15, 4], and can facilitate the search for 
the solution, as the next exercise shows. 
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For multiple inequality constraints, again at the solution V/ must lie in the 
space spanned by the Vcj, and again if the Lagrangian is L = f + 
then we must in addition have > 0 Vi (since otherwise / could be reduced 
by moving into the feasible region); and for inactive constraints, again we (can, 
usually must, and so might as well) set = 0. Thus the above KKT condition 
generalizes to AiCi(x*) = 0 Vi. Finally, a simple and often useful trick is to solve 
ignoring one or more of the constraints, and then check that the solution satisfies 
those constraints, in which case you have solved the problem; we’ll call this the 
below. 

Exercise 3. = ^ 

X e 7^‘^ , ^ Xi > 0 C . 

XiX^i — 0 ^ 

2.4 Cost Benefit Curves 

Here’s an example from channel coding. Suppose that you are in charge of four 
fiber optic communications systems. As you pump more bits down a given chan- 
nel, the error rate increases for that channel, but this behavior is slightly different 
for each channel. Figure 2 show a graph of the bit rate for each channel versus 
the ‘distortion’ (error rate). Your goal is to send the maximum possible number 
of bits per second at a given, fixed total distortion rate D. Let Di be the number 




Fig. 2. Total bit rate versus distortion for each system 



of errored bits sent down the i’th channel. Given a particular error rate, we’d like 
to find the maximum overall bit rate; that is, we must maximize the total rate 
R = X)i=i subject to the constraint D = A- Introducing a Lagrange 
multiplier A, we wish to maximize the objective function 

4 4 

L = Y,R^iD^) + X{D-Y,D^) ( 1 ) 

i=l i=l 

Setting dL/dDi = 0 gives dRi/dDi = A, that is, each fiber should be oper- 
ated at a point on its rate/distortion curve such that its slope is the same for all 
fibers. Thus we’ve found the general rule for resource allocation, for benefit/cost 
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curves like those shown^ in Figure 2: whatever operating point is chosen for each 
system, in order to maximize the benefit at a given cost, the slope of the graph 
at that point should be the same for each curve. For the example shown, the 
slope of each graph decreases monotonically, and we can start by choosing a 
single large value of the slope A for all curves, and decrease it until the condition 
Di = D is met, so in general for m fibers, an m dimensional search prob- 
lem has been reduced to a one dimensional search problem. We can get the same 
result informally as follows: suppose you had just two fibers, and were at an oper- 
ating point where the slope si of the rate/distortion graph for fiber 1 was greater 
than the slope S 2 for fiber 2. Suppose you then adjusted things so that fiber 1 sent 
one more errored bit every second, and fiber 2 sent one fewer. The extra number 
of bits you can now send down fiber 1 more than offsets the fewer number of bits 
you must send down fiber 2. This will hold whenever the slopes are different. For 
an arbitrary number of fibers, we can apply this argument to any pair of fibers, 
so the optimal point is for all fibers to be operating at the same slope. 



Puzzle 2: 






r 



Hi 



tli 

i’. 



2.5 An Isoperimetric Problem 

Isoperimetric problems - problems for which a quantity is extremized while a 
perimeter is held fixed - were considered in ancient times, but serious work 
on them began only towards the end of the seventeenth century, with a minor 
battle between the Bernoulli brothers [14]. It is a fitting example for us, since the 
general isoperimetric problem had been discussed for fifty years before Lagrange 
solved it in his first venture into mathematics [1], and it provides an introduction 
to functional derivatives, which we’ll need. Let’s consider a classic isoperimetric 
problem: to find the plane figure with maximum area, given fixed perimeter. 
Consider a curve with fixed endpoints {x = 0,y = 0} and {x = l,y = 0}, and 
fixed length p. We will assume that the curve defines a function, that is, that for 
a given x G [0, 1], there corresponds just one y. We wish to maximize the area 
between the curve and the x axis, A = fg ydx, subject to the constraint that 
the length, p = a/1 -|- y'^dx, is fixed (here, prime denotes differentiation with 

respect to x). The Lagrangian is therefore 

L = y ydx + \ \/i + y''^dx — p^ (2) 



^ This seemingly innocuous statement is actually a hint for the puzzle that follows. 
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Two new properties of the problem appear here: first, integrals appear in 
the Lagrangian, and second, we are looking for a solution which is a function, 
not a point. To solve this we will use the calculus of variations, introduced by 
Lagrange and Euler. Denote a small variation of a function^ / by 6f: that is, 
replace f{x) by f{x) + 6f{x) everywhere, where 6f is chosen to vanish at the 
boundaries, that is, 6f{Q) = 5f{l) = 0 (note that 6f is also a function of x). 
Here, y is the variable function, so the change in L is 

6L= f 6ydx + X f {1 + y'"^)~^^^y'6y'dx 
Jo Jo 

By using the facts that 6y' = = ^dy and that the variation in y vanishes 

at the endpoints, integrating by parts then gives: 

6L = I' (l- Xy"{l + 6ydx 

^ 1 -Xy''{l + = 1 - Ak = 0 

where k is the local curvature, and where the second step results from our being 
able to choose 8y arbitrarily on (0,1), so the quantity multiplying 8y in the 
integrand must vanish (imagine choosing 6y to be zero everywhere except over 
an arbitrarily small interval around some point x G [0, 1]). Since the only plane 
curves with constant curvature are the straight line and the arc of circle, we 
find the result (which holds even if the diameter of the circle is greater than 
one). Note that, as often happens in physical problems, A here has a physical 
interpretation (as the inverse curvature); A is always the ratio of the norms of 
V/ and Vc at the solution, and in this sense the size of A measures the influence 
of the constraint on the solution. 



2.6 Which Univariate Distribution has Maximum Entropy? 

Here we use differential entropy, with the understanding that the bin width 
is sufficiently small that the usual sums can be approximated by integrals, but 
fixed, so that comparing the differential entropy of two distributions is equivalent 
to comparing their entropies. We wish to find the function / that minimizes 



f{x)log 2 f{x)dx, X G 7^ 



( 3 ) 



subject to the four constraints 

/ oo poo poo 

fix) = 1: / xf{x) = Cl / x^/(x) = C2 

-OO J — oo J — oo 



® In fact Lagrange first suggested the use of the symbol S to denote the variation of a 
whole function, rather than that at a point, in 1755 [14]. 
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Note that the last two constraints, which specify the first and second mo- 
ments, is equivalent to specifying the mean and variance. Our Lagrangian is 
therefore: 



L = 



/ OO / pOO \ / pOO 

f{x) log 2 f{x)dx -h A ( 1 - / f{x) ] + / xf{x)da 

-OO \ j — OO / \ J — OO 



+ /?2 ( C2 — 



x^ f{x)dx 



where we’ll try the free constraint gambit and skip the positivity constraint. In 
this problem we again need the calculus of variations. In modern terms we use 
the , which is just a shorthand for capturing the rules of the 

calculus of variations, one of which is: 



^g{x) 

^g{v) 



6 (x - y) 



(4) 



where the right hand side is the Dirac delta function. Taking the functional 
derivative of the Lagrangian with respect to f{y) and integrating with respect 
to X then gives 

log2 fiy) + log2(e) -\- Piy- P2y^ = 0 (5) 

which shows that / must have the functional form 

f{y) = ( 6 ) 



where C is a constant. The values for the Lagrange multipliers A, Pi and P 2 
then follow from the three equality constraints above, giving the result that the 
Gaussian is the desired distribution. Finally, choosing C > 0 makes the result 
positive everywhere, so the free constraint gambit worked. 

Puzzle 3: ^ IV 

Pi = 1/iVVi 



Exercise 4. 



Puzzle 4: 

[-c,c\. 

h{Pu) = - r (l/2C)log2(l/2C)dx 
J-c 

= -log 2 (l/ 2 C) 





( 7 ) 



h 
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2.7 Maximum Entropy with Linear Constraints 

Suppose that you have a discrete probability distribution = 1, and 

suppose further that the only information that you have about the distribution 
is that it must satisfy a set of linear constraints: 

'^aj^P^ = Cj, j = (8) 

i 

The approach (see [5], for example) posits that, subject 

to the known constraints, our uncertainty about the set of events described by 
the distribution should be as large as possible, or specifically, that the mean 
number of bits required to describe an event generated from the constrained 
probability distribution be as large as possible. Maximum entropy provides a 
principled way to encode our uncertainty in a model, and it is the precursor 
to modern Bayesian techniques [13]. Since the mean number of bits is just the 
entropy of the distribution, we wish to find that distribution that maximizes^ 

- ^ p. log P. + J2 (Cj - E + ^(E - 1) - E (9) 

i j i i 

where the sum constraint on the Pi is imposed with /i, and the positivity of each 
Pi with bi (so Si > 0 and at the maximum, SiPi = 0 Vi)^. Differentiating with 
respect to Pk gives 



Pfc = exp(-l + ^ - 4 - ^ AjOj-fc) (10) 

3 

Since this is guaranteed to be positive we have = 0 V/c. Imposing the sum 
constraint then gives Pk = ^ exp(— Xjajk) where the “partition function” Z 
is just a normalizing factor. Note that the Lagrange multipliers have shown us the 
form that the solution must take, but that form does not automatically satisfy 
the constraints - they must still be imposed as a condition on the solution. The 
problem of maximizing the entropy subject to linear constraints therefore gives 
the widely used logistic regression model, where the parameters of the model 
are the Lagrange multipliers Xi, which are themselves constrained by Eq. (8). 
For an example from the document classification task of how imposing linear 
constraints on the probabilities can arise in practice, see [16]. 

2.8 Some Algorithm Examples 

Lagrange multipliers are ubiquitous for imposing constraints in algorithms. Here 
we list their use in a few modern machine learning algorithms; in all of these ap- 
plications, the free constraint gambit proves useful. For support vector machines, 
the Lagrange multipliers have a physical force interpretation, and can be used to 



^ The factor log2 e can be absorbed into the Lagrange multipliers. 
® Actually the free constraint gambit would work here, too. 
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find the exact solution to the problem of separating points in a symmetric sim- 
plex in arbitrary dimensions [6]. For the remaining algorithms mentioned here, 
see [7] for details on the underlying mathematics. In showing that the principal 
PCA directions give minimal reconstruction error, one requires that the projec- 
tion directions being sought after are orthogonal, and this can be imposed by 
introducing a matrix of multipliers. In locally linear embedding [17], the trans- 
lation invariance constraint is imposed for each local patch by a multiplier, and 
the constraint that a solution matrix in the reconstruction algorithm be orthog- 
onal is again imposed by a matrix of multipliers. In the Laplacian eigenmaps 
dimensional reduction algorithm [2], in order to prevent the collapse to trivial 
solutions, the dimension of the target space is enforced to be d > 0 by requiring 
that the rank of the projected data matrix be d, and again this imposed using a 
matrix of Lagrange multipliers. 

Joseph Louis Lagrange was born in 1736 in Turin. He was one 
of only two of eleven siblings to survive infancy; he spent most of his life in Turin, 
Berlin and Paris. He started teaching in Turin, where he organized a research 
society, and was apparently responsible for much fine mathematics that was 
published from that society under the names of other mathematicians [3, 1]. He 



’ [3]®. His contributions lay in the subjects of mechanics, 
calculus^, the calculus of variations®, astronomy, probability, group theory, and 
number theory [14]. Lagrange is at least partly responsible for the choice of base 
10 for the metric system, rather than 12. He was supported academically by Euler 
and d’Alembert, financed by Frederick and Louis XIV, and was close to Lavoisier 
(who saved him from being arrested and having his property confiscated, as 
a foreigner living in Paris during the Revolution), Marie Antoinette and the 
Abbe Marie. He survived the Revolution, although Lavoisier did not. His work 
continued to be fruitful until his death in 1813, in Paris. 



3 Some Notes on Matrices 

This section touches on some useful results in the theory of matrices that are 
rarely emphasized in coursework. For a complete treatment, see for example [12] 
and [11]. Following [12], the set of p by g matrices is denoted Mpq, the set of 
(square) p by p matrices by Mp, and the set of symmetric p by p matrices by 
Sp. We work only with real matrices - the generalization of the results to the 
complex field is straightforward. In this section only, we will use the notation 
in which repeated indices are assumed to be summed over, so that for example 



° Sadly, at that time there were very few female mathematicians. 

For example he was the first to state Taylor’s theorem with a remainder [14]. 

® . . . with which he started his career, in a letter to Euler, who then generously delayed 
publication of some similar work so that Lagrange could have time to finish his work 
[!]• 




30 



C.J.C. Burges 



AijBjkCki is written as shorthand for j, AijBjkCki- Let’s warm up with some 
basic facts. 



3.1 A Dual Basis 

Suppose you are given a basis of d orthonormal vectors G a = 1,. . . ,d, 
and you construct a matrix E G Md whose columns are those vectors. It is a 
striking fact that the rows of E then also always form an orthonormal basis. We 
can see this as follows. Let the have components i = l,...,d. Let’s 
write the vectors constructed from the rows of if as e so that = ei* ^ . Then 

orthonormality of the columns can be encapsulated as if^if = 1. However since 
E has full rank, it has an inverse, and E"’" EE~^ = E~^ = E'^ , so EE"'" = 1 (using 
the fundamental fact that the left and right inverses of any square matrix are the 
same) which shows that the rows of E are also orthonormal. The vectors are 
called the dual basis to the This result is sometimes useful in simplifying 
expressions: for example where A is some function, can be 

replaced by A{i,i)dij. 



3.2 Other Ways to Think About Matrix Multiplication 

Suppose you have matrices X G Mmn and Y G M„p so that XY G M^p- The 
familiar way to represent matrix multiplication is (XY)ab = YIi=i XaiYib, where 
the summands are just products of numbers. However an alternative represen- 
tation is XY = where (y') is the t’th column (row) of X (T), and 

where the summands are outer products of matrices. For example, we can write 
the product of a 2 x 3 and a 3 x 2 matrix as 

a b c 
de f 



g h 

i j 
k I 



[g h] 



J] 



[k 1] 



One immediate consequence (which we’ll use in our description of singular 
value decomposition below) is that you can always add columns at the right 
of X, and rows at the bottom of Y, and get the same product XY , provided 
either the extra columns, or the extra rows, contain only zeros. To see why this 
expansion works it’s helpful to expand the outer products into standard matrix 
form: the matrix multiplication is just 



J /a 0 0\ /O 6 0\ /O 0 c\ 



X 




-b 





Along a similar vein, the usual way to view matrix-vector multiplication is 
as an operation that maps a vector z G TiA to another vector z' G 7^™: z' = Az. 
However you can also view the product as a linear combination of the columns 
of A: z' = X)r=i With this view it’s easy to see why the result must lie in 
the span of the columns of A. 
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3.3 The Levi-Civita Symbol 

The Levi-Civita symbol® in d dimensions is denoted ey...fc and takes the value 
1 if its d indices are an even permutation of 1, 2, 3, • • • , d, the value -1 if an odd 
permutation, and 0 otherwise. The 3-dimensional version of this is the fastest 
way I know to derive vector identities in three dimensions, using the identity 
iijk^imn = Sjmdkn ~ ^jnSkm (recall that repeated indices are summed). 

Exercise 5. a = b A c 

— ^ijk^j^k . . 

(a A b) • (c A d) = (a • c)(b • d) — (a • d)(b • c) 



3.4 Characterizing the Determinant and Inverse 

The determinant of a matrix A G M„ can be defined as 

1^1 = ;^^aia2'"an^/3l/32'"/3n^ai/3l^a2/32 ■ ■ ■ (H) 



Exercise 6. 



|A| = 

^ctia2---ctn -^2a2 ’ ’ ’ (12) 

We can use this to prove an interesting theorem linking the determinant, 
derivatives, and the inverse: 

Lemma 1. A, 



dAij 




(13) 



a|A| 



dAi, 



di\A2(X2 ' ' ' -^n 



^aij- -an ^^2^3a3 ' ’ ’ -^nctn 



SO 



A. 



kj 



^1 

dA, 



= £o 



nOLn A\oi^^A}^Q^2^i2A^Q^2 • * * -t- * • • ) 



For any value of i, one and only one term in the sum on the right survives, 
and for that term, we must have k = ihy antisymmetry of the e. Thus the right 
hand side is just Multiplying both sides on the right by gives the 

result. □ 



® The name ‘tensor’ is sometimes incorrectly applied to arbitrary objects with more 
than one index. In factor a tensor is a generalization of the notion of a vector and is 
a geometrical object (has meaning independent of the choice of coordinate system); 
e is a pseudo-tensor (transforms as a tensor, but changes sign upon inversion). 
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We can also use this to write the following closed form for the inverse: 



~ ^iaia2'"an-l^i/3i/32-"/3n-l^ai/3l^a2/32 ■ ■ ■ 






Exercise 7. 



Exercise 8. 

OA. ■ 



_ _ A -1 4-1 / 

OAc,0 ia ( 



( ) ( ) 

A~^A = 1) 



Exercise 9. p(x) 

l^l^^/^exp (-i(x - /x)'if“^(x - /x)) n. 

p(xi,X2 ,--- ,x„|/x, r) = 

- s 



( 14 ) 
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Puzzle 5: 



, , n = 2, xi = — X 2 , 

4; 

X , 






3.5 SVD in Seven Steps 

Singular value decomposition is a generalization of eigenvalue decomposition. 
While eigenvalue decomposition applies only to square matrices, SVD applies to 
rectangular; and while not all square matrices are diagonalizable, every matrix 
has an SVD. SVD is perhaps less familiar, but it plays important roles in every- 
thing from theorem proving to algorithm design (for example, for a classic result 
on applying SVD to document categorization, see [10]). The key observation is 
that, given A S M^n, although we cannot perform an eigendecomposition of A, 
we can do so for the two matrices AA"'" G Sm and A^A G S'„. Since both of 
these are positive semidefinite, their eigenvalues are non-negative; if AA"^ has 
rank fc, define the ‘singular values’ af to be its k positive eigenvalues. Below we 
will use ‘nonzero eigenvector’ to mean an eigenvector with nonzero eigenvalue, 
will denote the diagonal matrix whose I’th diagonal component is Ci by diag(cri), 
and will assume without loss of generality that m < n. Note that we repeatedly 
use the tricks mentioned in Section (3.2). Let’s derive the SVD. 

1. AA^ A^ a Let Xi G be an eigenvec- 

tor of AAA with positive eigenvalue erf, and let = (l/cTi)(24’^Xi), y G 7^". 
Then A^ Ayi = {1/ ai)A^ AA^-x-i = aiA^Xi = afy*. Similarly let y* G 7^" be 
an eigenvector of A^A with eigenvalue af, and let Zi = {l/a[){Ayi). Then 
AA^Zi = {l/a'j)AA^ Ayi = a[Ayi = afzi. Thus there is a 1-1 correspon- 
dence between nonzero eigenvectors for the matrices A^ A and AA^ , and the 
corresponding eigenvalues are shared. 
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2. Xi Yi 

The Xi are orthonormal, or can be so chosen, since they are eigenvectors of 
a symmetric matrix. Then ex x'AA^Xj cx x^ • Xj (x bij. 

3. rank(A) = rank(A^) = rank(AA^) = rank(A^A) = k [12]. 

4. Let the x^ be the nonzero eigenvectors of AA^ and the y^ those of A. Let 
X S Mjnk {Y G Mnk) be the matrix whose columns are the x^ (y^). Then 

Y = A^Ardiag(l/(Ti) diag(cTi)y^ = X'^A. Note that m > k; if m = k, 
then A = Ardiag(cTj)y^. 

5. If m > k, add m—k rows of orthonormal null vectors of A"^ to the bottom of 
X'^ , and add m — k zero rows to the bottom of diag(cTi); defining the latter 
to be diag((Ti,0), then X is orthogonal and A = Xdiag((Ti, 0)F^. Note that 
here, X G M^, diag((Ji,0) G M^^k and Y G M„fc. 

6. To get something that looks more like an eigendecomposition, add n — k 
rows of vectors that, together with the yi form an orthonormal set, to the 
bottom of Y'^, and add n — k columns of zeros to the right of diag(cTj,0); 
defining the latter to be diag((Ti, 0, 0), then the Y are also orthogonal and 
A = Xdiag((Ji, 0, 0)T^. Note that here, X G Mm, diag((Ti, 0, 0) G Mmn, and 

Y GMn. 

7. To get something that looks more like a sum of outer products, just write A 

in step (4) as A = o-*x,y'. 

Let’s put the singular value decomposition to work. 



3.6 The Mo ore- Penrose Generalized Inverse 

Suppose B G Sm has eigendecomposition B = EAE'^ , where A is diagonal and E 
is the orthogonal matrix of column eigenvectors. Suppose further that B is non- 
singular, so that B~^ = EA~^E'^ = This suggests that, since SVD 

generalizes eigendecomposition, perhaps we can also use SVD to generalize the 
notion of matrix inverse to non-square matrices A G Mmn- The Moore-Penrose 
generalized inverse (often called just the generalized inverse) does exactly this^°. 
In outer product form, it’s the SVD analog of the ordinary inverse, with the latter 
written in terms of outer products of eigenvectors: A^ = € Mnm- 

The generalized inverse has several special properties: 

1. AAt AtA 

2. AAfA = A; 

3. AtAAt = At. 

In fact. At is uniquely determined by conditions (1), (2) and (3). Also, if A is 
square and nonsingular, then At = A~^, and more generally, if (A^A)“^ exists, 
then At = (A^A)“^A^, and if (AA^)“^ exists, then At = A^(AA^)“^. The 
generalized inverse comes in handy, for example, in characterizing the general 
solution to linear equations, as we’ll now see. 



The Moore-Penrose generalized inverse is one of many pseudo inverses. 
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3.7 SVD, Linear Maps, Range and Null Space 

If A e Mmn, the of A, TZ{A), is defined as that subspace spanned by 

y = Ax for all x G 7^". A’s Af{A), on the other hand, is that subspace 

spanned by those x G 7^" for which Ax = 0. Letting A^^ denote the columns 
of A, recall that ^x = x\A\i + X 2 A \2 + ••• + so that the dimension 

of Ti-iA) is the rank k of A, and Tl{A) is spanned by the columns of A. Also, 
Af{A'^) is spanned by those vectors which are orthogonal to every row of 
(or every column of A), so 7^(A) is the orthogonal complement of Af{A’^). The 
notions of range and null space are simply expressed in terms of the SVD, A = 
X G y G 7^". The null space of A is the subspace orthogonal 
to the k Yi, so dim(A/’(A)) = n — k. The range of A is spanned by the x^, so 
dim(7^(A)) = k. Thus in particular, we have dim(7^(A)) + dim(A/’(A)) = n. 

The SVD provides a handy way to characterize the solutions to linear systems 
of equations. In general the system Az = b, A G M„^„, z G 7^", b G 7^™ has 0, 1 
or oo solutions (if zi and Z 2 are solutions, then so is azi + /3z2, a,P G TZ). When 
does a solution exist? Since Az is a linear combination of the columns of A, b 
must lie in the span of those columns. In fact, if b G TZ{A), then zg = A’^b is 
a solution, since Azg = J2i^i o^XiY^ I]^^i(l/CTi)yjx' b = ^ix'b = b, and 
the general solution is therefore z = A^b + Af(A). 

Puzzle 6: ^ ^ b ^ TZ{A) 

What if b ^ 7Z{A), i.e. Az = b has no solution? One reasonable step would 
be to find that z that minimizes the Euclidean norm ||Az — bj]. However, adding 
any vector in A/"(A) to a solution z would also give a solution, so a reasonable 
second step is to require in addition that ||z|j is minimized. The general solution 
to this is again z = A^^b. This is closely related to the following unconstrained 
quadratic programming problem: minimize /(z) = ^ 2 ! At, + 6z, x G 7^", A ^ 0. 
(We need the extra condition on A since otherwise / can be made arbitrarily 
negative). The solution to this is at V/ = 0 ^ Az + b = 0, so the general 
solution is again z = A’^b + M {A). 

Puzzle 7: b ^ TZ{A) , . , A ^ 0 

/ ^ 



3.8 Matrix Norms 

A function || • |j : — > 7^ is a over a field T if for all A, B G Mmn, 

1. ||A||>0 

2. ||A|| =0 44> A = 0 

3. i|cA|| = |c|||A|| for all scalars cGT 

4. ||A + H|l<||A|| + ||i?|| 

The Frobenius norm, ||A||i? = 1^*7 is often used to represent the 

distance between matrices A and B as ||A — when for example one is 

searching for that matrix which is as close as possible to a given matrix, given 
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some constraints. For example, the closest positive semidefinite matrix, in Frobe- 
nius norm, to a given symmetric matrix A, is A = X)i-A >o where the 

Ai, are the eigenvalues and eigenvectors of A, respectively. The Minkowski 
vector p-norm also has a matrix analog: ||A||p = max||x||=i ||Ax||p. There are 
three interesting special cases of this which are easy to compute: the maximum 
absolute column norm, ||A||i = maxj \Aij\, the maximum absolute row norm, 
||A||oo = maxi \^ijV E^nd the spectral norm, ||A|| 2 . Both the Frobenius and 
spectral norms can be written in terms of the singular values: assuming the 

ordering cti > (T 2 • • • > Cfe, then ||A ||2 = cti and || A||f = \JYh=i 

Exercise IQ. U W , \\UAW\\f = 

UWf 

Exercise 11. submultiplicative property ||AB|| < \\a\\\\bi. 

AeMm 

... , ||A|| < F , (1 + Al)-1 = 

1- A + A"^ - A^ -\ , A 

, ||A-i||<l. (l+A)-i = A-i(1-A-i+A-2-A-3+...) 

W. W{1 + W'W)-^W = {1 + WW')-^WW = 
VF1F'(1 + VFlF')”i (. , 

The Minkowski p norm has the important property that ||ylx llp<P||,||x||^. 
Let’s use this, and the L\ and Loo matrix norms, to prove a basic fact about 
stochastic matrices. A matrix P is stochastic if its elements can be interpreted 
as probabilities, that is, if all elements are real and non-negative, and each row 
sums to one (row-stochastic), or each column sums to one (column-stochastic), 
or both (doubly stochastic). 

Theorem 1. P 

[0,1] 

For any p > 1, and x any eigenvector of P, ||Px|jp = |A| ||x||p < |jP||p |jx||p 
so |A| < ||P||p. Suppose that P is row-stochastic; then choose the Loo norm, which 
is the maximum absolute row norm ||P||oo = \Pij\ = 1; so |A| < 1. If 

P is column-stochastic, choosing the 1-norm (the maximum absolute column 
norm) gives the same result. □ 

Note that stochastic matrices, if not symmetric, can have complex eigenval- 
ues, so in this case T is the field of complex numbers. 

3.9 Positive Semidefinite Matrices 

Positive semidefinite matrices are ubiquitous in machine learning theory and 
algorithms (for example, every kernel matrix is positive semidefinite, for Mercer 



11 



Some authors include this in the definition of matrix norm [12]. 
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kernels). Again we restrict ourselves to real matrices. A matrix A G S'„ is positive 
definite iff for every x G 7^", x'Ax > 0; it is positive semidefinite iff for every 
X G 7^", x'Ax > 0, and some x exists for which the equality is met. Recall 
that we denote the property of positive definiteness of a matrix A hy A >■ 0, 
and positive semidefiniteness by A ^ 0. Let’s start by listing a few properties, 
the first of which relate to what positive semidefinite matrices look like (here, 
repeated indices are not summed): 

1. If Ay 0, then An > 0 Vz; 

2. If A ^ 0, then An > 0 Vz; 

3. If A ^ 0, then An = 1 Vz |Ajj| < I Vz, j; 

4. If A G Sn is strictly diagonally dominant, that is. An > \^ij \ then 
it is also positive definite; 

5. If A ^ 0 and An = 0 for some z, then A^ = Aji = 0 Vj; 

6. If A ^ 0 then AnAjj > |Ay p Vz, j; 

7. If A G ^ 0 and i? G ^ 0 then AB y 0; 

8. A G Sn is positive semidefinite and of rank one iff A = xx' for some x G 7^"; 

9. A 0 A all of the leading minors of A are positive. 

A very useful way to think of positive semidefinite matrices is in terms of 
Gram matrices. Let R be a vector space over some field IF, with inner product 
(•,•). The G of a set of vectors Vj G R is defined by G^ = (vj, Vj). 

Now let V be Euclidean space and let T be the reals. The key result is the 
following: let A G Sn- Then A is positive semidefinite with rank r if and only if 
there exists a set of vectors {vi, . . . , v„}, G V, containing exactly r linearly 
independent vectors, such that A^ = • Vj. 

Note in particular that the vectors v can always be chosen to have dimension 
r < n. 

Puzzle 8 : kernel matrix TV G S'„ ^ 

Kij = k{xi,Xj) ^i,^j G i,j = l,...,zz <i, k 

k, . H 

fc(xi,Xj) = ^(xj) -^(xj) . . H 

( k{x„xj) = exp-(i/'^Al|xi-xdl") 

n 



Some properties of positive semidefinite matrices that might otherwise seem 
mysterious become obvious, when they are viewed as Gram matrices, as I hope 
the following exercise helps demonstrate. 



Exercise 12. 

()-()-()- ( ) 

( ) ().()() () 



If the Gram representation is so useful, the question naturally arises: given 
a positive semidefinite matrix, how can you extract a set of Gram vectors for 
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it? (Note that the set of Gram vectors is never unique; for example, globally 
rotating them gives the same matrix). Let 7l G ^ 0 and write the eigen- 
decomposition of A in outer product form: A = X)a=i or A^j = 

^"^1 Aae^“^ej“^ Written in terms of the dual eigenvectors (see Section 3.1): 
Aij = £a\ the summand has become a weighted dot product; we 

can therefore take the set of Gram vectors to be Ua = vAaCo • The Gram 
vectors therefore are the dual basis to the scaled eigenvectors. 

3.10 Distance Matrices 

One well-known use of the Gram vector decomposition of positive semidefinite 
matrices is the following. Define a ‘distance matrix’ to be any matrix of the 
form Dij G Sn = ||xi — Xjp, where || • || is the Euclidean norm (note that 
the entries are actually squared distances). A central goal of multidimensional 
scaling is the following: given a matrix which is a distance matrix, or which is 
approximately a distance matrix, or which can be mapped to an approximate 
distance matrix, find the underlying vectors Xi G 7Z^, where d is chosen to be 
as small as possible, given the constraint that the distance matrix reconstructed 
from the x^ approximates D with acceptable accuracy [8] . d is chosen to be small 
essentially to remove unimportant variance from the problem (or, if sufficiently 
small, for data visualization). Now let e be the column vector of n ones, and 
introduce the ‘centering’ projection matrix = 1 — Aee'. 

Exercise 13. f J x G 7Z" P®x 

X ^ ( ) = 0 f j ® 

P^^ () 

Aij G Sm = ij = ■ ,m, Xi G TZ^, . {P''AP^)ij = {x^-fi)-{xj-fi), 

^ At Xi 

The earliest form of the following theorem is due to Schoenberg [18]. For a 
proof of this version, see [7]. 

Theorem 2. 

0 A,, = Q \/i,j . A = -P<^AP^ 

A , , 7Z‘^ 

A, 



A G Sn Aij > 

d _ A. 

A, 

1 

vA 



3.11 Computing the Inverse of an Enlarged Matrix 

We end our excursion with a look at a trick for efficiently computing inverses. 
Suppose you have a symmetric matrix K G S'n-i, and suppose you form a new 
symmetric matrix by adding a number u = K„n and a column v, Vi = Ki„ (and 
a corresponding row Kni = Kin). Denote the enlarged matrix by 



K. = 



K V 
v' u 



(15) 
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Now consider the inverse 

(10 

where again b is a column vector and c is a scalar. It turns out that it is straight- 
forward to compute A, b and c in terms of v and u. Why is this useful? 
In any machine learning algorithm where the dependence on all the data is cap- 
tured by a symmetric matrix K(ni,Xj), then in test phase, when a prediction is 
being made for a single point x, the dependence on all the data is captured by 
where Vi = KijKi, x) and u = if(x, x). If that algorithm in addition requires 
that the quantities b and c be computed, it’s much more efficient to compute 
them by using the following simple lemma (and computing K~^ just once, for 
the training data), rather than by computing for each x. This is used, 

for example, in Gaussian process regression and Gaussian process classification, 
where in Gaussian process regression, c is needed to compute the variance in the 
estimate of the function value /(x) at the test point x, and b and c are needed 
to compute the mean of /(x) [9, 20]. 



Lemma 2. 

K+ 



K G M„_i K+ G M„ 



1 



b = - 



u — 

1 






u — v'iG 



-VK 






(17) 

(18) 
(19) 



det(7f) 1 

det(itr_|_) u — \'K~^v 



(20) 



Since the inverse of a symmetric matrix is symmetric, K_^_^ can be written 
in the form (16). Then requiring that = 1 gives (repeated indices are 



summed) : 


i < n, j < n : 




(21) 




i = n, j < n : 


^m^mj — 0 


(22) 




i < n, j = n : 




(23) 




i = n, j = n : 


bmVm CU= 1 


(24) 



Eq. (22) gives b = —cv'K~^. Substituting this in (24) gives Eq. (17), and 
substituting it in (21) gives Eq. (19). Finally the expression for the ratio of 
determinants follows from the expression for the elements of an inverse matrix 
in terms of ratios of its cofactors. □ 
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Exercise 14. 



():():(■) ( ) 



K-[- G S2 



Puzzle 9; 



( n = 2) 

AG Sn^ 



( ) 
( 





1 + 6 , 6 1 , 
( . ; 
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Abstract. This article gives a basic introduction to the principles of 
Bayesian inference in a machine learning context, with an emphasis on 
the importance of marginalisation for dealing with uncertainty. We be- 
gin by illustrating concepts via a simple regression task before relating 
ideas to practical, contemporary, techniques with a description of ‘sparse 
Bayesian’ models and the ‘relevance vector machine’. 



1 Introduction 

What is meant by “Bayesian inference” in the context of machine learning? To 
assist in answering that question, let’s start by proposing a conceptual task: we 
wish to learn, from some given number of example instances of them, a model 
of the relationship between pairs of variables A and B. Indeed, many machine 
learning problems are of the type “given A, what is B?”.^ 

Verbalising what we typically treat as a mathematical task raises an interest- 
ing question in itself. How do we answer “what is B?”? Within the appealingly 
well-defined and axiomatic framework of propositional logic, we ‘answer’ the 
question with complete certainty, but this logic is clearly too rigid to cope with 
the realities of real-world modelling, where uncertainty over ‘truth’ is ubiquitous. 
Our measurements of both the dependent (B) and independent (A) variables are 
inherently noisy and inexact, and the relationships between the two are invari- 
ably non-deterministic. This is where probability theory comes to our aid, as it 
furnishes us with a principled and consistent framework for meaningful reasoning 
in the presence of uncertainty. 

We might think of probability theory, and in particular Bayes’ rule, as pro- 
viding us with a “logic of uncertainty” [1]. In our example, given A we would 
‘reason’ about the likelihood of the truth of B (let’s say B is binary for exam- 
ple) via its conditional probability P{B\A)-. that is, “what is the probability of 
B given that A takes a particular value?” . An appropriate answer might be “H 
is true with probability 0.6”. One of the primary tasks of ‘machine learning’ is 



^ In this article we will focus exclusively on snch ‘snpervised learning’ tasks, although 
of course there are other modelling applications which are equally amenable to 
Bayesian inferential techniqnes. 
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then to approximate P{B\A) with some appropriately specified model based on 
a given set of corresponding examples of A and B? 

It is in the modelling procedure where Bayesian inference comes to the fore. 
We typically (though not exclusively) deploy some form of parameterised model 
for our conditional probability: 

P{B\A) = f{A;^), (1) 

where w denotes a vector of all the ‘adjustable’ parameters in the model. Then, 
given a set T> oi N examples of our variables, T> = {An, Bn}n^i, a conventional 
approach would involve the maximisation of some measure of ‘accuracy’ (or 
minimisation of some measure of ‘loss’) of our model for T> with respect to the 
adjustable parameters. We then can make predictions, given A, for unknown B 
by evaluating f{A; w) with parameters w set to their optimal values. Of course, 
if our model / is made too complex — perhaps there are many adjustable pa- 
rameters w — we risk over-specialising to the observed data T>, and consequently 
realising a poor model of the true underlying distribution P(B\A). 

The first key element of the Bayesian inference paradigm is to treat pa- 
rameters such as w as random variables, exactly the same as A and B. So 
the conditional probability now becomes P{B\A,w), and the dependency of the 
probability of B on the parameter settings, as well as A, is made explicit. Rather 
than ‘learning’ comprising the optimisation of some quality measure, a . 

over the parameters w is inferred from Bayes’ rule. We will demonstrate 
this concept by means of a simple example regression task in Section 2. 

To obtain this ‘posterior’ distribution over w alluded to above, it is necessary 
to specify a ‘prior’ distribution p(w) before we observe the data. This may be 
considered an inconvenience, but Bayesian inference treats all sources of uncer- 
tainty in the modelling process in a unified and consistent manner, and forces 
us to be explicit as regards our assumptions and constraints; this in itself is 
arguably a philosophically appealing feature of the paradigm. 

However, the most attractive facet of a Bayesian approach is the manner 
in which “Ockham’s Razor” is automatically implemented by ‘integrating out’ 
all irrelevant variables. That is, under the Bayesian framework there is an au- 
tomatic preference for simple models that sufficiently explain the data without 
unnecessary complexity. We demonstrate this key feature in Section 3, and in 
particular underline the point that , ... p(w) 

. We show that, in practical terms, the concept of 
Ockham’s Razor enables us to ‘set’ regularisation parameters and ‘select’ models 
without the need for any additional validation procedure. 

The practical disadvantage of the Bayesian approach is that it requires us 
to perform integrations over variables, and many of these computations are 
analytically intractable. As a result, much contemporary research in Bayesian 



^ In many learning methods, this conditional probability approximation is not made 
explicit, thongh snch an interpretation may exist. However, one might consider it a 
significant limitation if a particular machine learning procedure cannot be expressed 
coherently within a probabilistic framework. 




Bayesian Inference: Principles and Practice in Machine Learning 



43 



approaches to machine learning relies on, or is directly concerned with, approxi- 
mation techniques. However, we show in Section 4, where we describe the “sparse 
Bayesian” model, that a combination of analytic calculation and straightforward, 
practically efficient, approximation can offer state-of-the-art results. 



2 Prom Least-Squares to Bayesian Inference 

We introduce the methodology of Bayesian inference by considering an example 
prediction (regression) problem. Let us assume we are given a very simple data 
set (illustrated later within Figure 1) comprising IV = 15 samples artificially 
generated from the function y = sin(x) with added Gaussian noise of variance 
0.2. We will denote the ‘input’ variables in our example by a;„, n = 1 . . . iV. For 
each such x„, there is an associated real- valued ‘target’ tn, n = I . . . N , and from 
these input-target pairs, we wish to ‘learn’ the underlying functional mapping. 



2.1 Linear Models 

We will model this data with some parameterised function y(x;w), where w = 
{wi,W 2 , . • . , wm) is the vector of adjustable model parameters. Here, we consider 
linear models (strictly, “linear-in-the-parameter” ) models which are a linearly- 
weighted sum of M fixed (but potentially linear) basis functions <j)m{x). 

M 

y(x;w) = ^ Wm4'm{x). (2) 

m—1 

For our purposes here, we make the common choice to utilise Gaussian data- 
centred basis functions (j)m{x) = exp { — (x — Xm)^/r^}, which gives us a ‘radial 
basis function’ (RBF) type model. 



“Least-Squares” Approximation. Our objective is to find values for w such 
that y{x] w) makes good predictions for new data: . it models the 

. A classic approach to estimating y{x-,w) is “least-squares”, 
minimising the error measure: 



Eti{w) 



1 

2 



N 



E 



M 



n 2 



^ ^ '^Cm0m(Xn) 
m—1 



( 3 ) 



If t = (G, . . . ,tNy and # is the ‘design matrix’ such that ^nm = ^m(x„), 
then the minimiser of (3) is obtained in closed-form via linear algebra: 



wls = 



( 4 ) 



However, with M = 15 basis functions and only A = 15 examples here, 
we know that minimisation of squared-error leads to a model which exactly 
interpolates the data samples, as shown in Figure 1. 

Now, we may look at Figure 1 and exclaim “the function on the right is 
clearly over-fitting!”. But, without prior knowledge of the ‘truth’, can we really 
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Ideal fit Least-squares RBF fit 





Fig. 1. Overfitting? The ‘ideal fit’ is shown on the left, while the least-squares fit using 
15 basis functions is shown on the right and perfectly interpolates all the data points 



judge which model is genuinely better? The answer is that we can’t — in a real- 
world problem, the data could quite possibly have been generated by a complex 
function such as shown on the right. The only way that we can proceed to 
meaningfully learn from data such as this is by imposing some prejudice 

on the nature of the complexity of functions we expect to elucidate. A common 
way of doing this is via ‘regularisation’. 



2.2 Complexity Control: Regularisation 

A common, and generally very reasonable, assumption is that we typically expect 
that data is generated from smooth, rather than complex, functions. In a linear 
model framework, smoother functions typically have smaller weight magnitudes, 
so we can penalise complex functions by adding an appropriate penalty term to 
the cost function that we minimise: 



A(w) = Ax)(w) -I- XEwi'w). 



( 5 ) 



A standard choice is the squared-weight penalty, 
which conveniently gives the “penalised least-squares” 



Aw(w) = 

(PLS) estimate for w: 



wpis = (^^#4- 



( 6 ) 



The A balances the trade-off between A'p(w) and i?w(w) — 

between how well the function fits the data and how smooth it is. Given that 
we can compute the weights directly for a given A, the learning problem is now 
transformed into one of finding an appropriate value for that hyperparameter. A 
very common approach is to assess potential values of A according to the error 
calculated on a set of ‘validation’ data ( data which is not used to estimate 
w), and examples of fits for different values of A and their associated validation 
errors are given in Figure 2. 

In practice, we might evaluate a large number of models with different hyper- 
parameter values and select the model with lowest validation error, as demon- 
strated in Figure 3. We would then hope that this would give us a model which 
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Fig. 2. Function estimates (solid line) and validation error for three different values 
of regularisation hyperparameter A (the true function is shown dashed). The training 
data is plotted in black, and the validation set in green (gray) 

was close to ‘the truth’. In this artificial case where we know the generative 
function, the deviation from ‘truth’ is illustrated in the figure with the measure- 
ment of ‘test error’, the error on noise-free samples of sin(a;). We can see that 
the minimum validation error does not quite localise the best test error, but it is 
arguably satisfactorily close. We’ll come back to this graph in Section 3 when we 
look at marginalisation and how Bayesian inference can be exploited in order to 
estimate A. For now, we look at how this regularisation approach can be initially 
reformulated within a Bayesian probabilistic framework. 

2.3 A Probabilistic Regression Framework 

We assume as before that the data is a noisy realisation of an underlying func- 
tional model: tn = y{xn]'^) + Cn- Applying least-squares resulted in us min- 
imising e^, but here we first define an explicit probabilistic model over the 
noise component e„, chosen to be a Gaussian distribution with mean zero and 
variance cr^. That is, _p(e„|cr^) = iV(0,(T^). Since = ?/(a;„;w) -|- e„ it fol- 
lows that p(t„|a;„,w, cr^) = A(j/(a;„; w), cr^). Assuming that each example from 
the the data set has been generated independently (an often realistic assump- 
tion, although not always true), the ^ of all the data is given by the 

product: 



® Although ‘probability’ and ‘likelihood’ functions may be identical, a common con- 
vention is to refer to “probability” when it is primarily interpreted as a function 
of the random variable t, and “likelihood” when interpreted as a function of the 
parameters w. 
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Fig. 3. Plots of error computed on the separate 15-example training and validation 
sets, along with ‘test’ error measured on a third noise-free set. The minimum test 
and validation errors are marked with a triangle, and the intersection of the best A 
computed via validation is shown 



N 

p(t|x,w, cr^) = p(t„|x„,w,CT^), 

n—1 

{tn - w)}^ 

2a2 



N 

= (27rcr^)“^^^ exp 

n—1 



( 7 ) 

( 8 ) 



Note that, from now on, we will write terms such as p(t|x,w, a^) as p(t|w, cr^), 
since we never seek to model the given input data x. Omitting to include such 
conditioning variables is purely for notational convenience (it implies no further 
model assumptions) and is common practice. 



2.4 Maximum Likelihood and Least-Squares 

The ‘maximum- likelihood’ estimate for w is that value which maximises p(t|w, 
(T^). In fact, this is identical to the ‘least-squares’ solution, which we can see by 
noting that minimising squared-error is equivalent to minimising the negative 
logarithm of the likelihood which here is: 

N 1 

- l 0 gp(t|w, 0-2) = — l0g(27TCT2) + y{Xn; w)}^ . (9) 

n—1 

Since the first term on the right in (9) is independent of w, this leaves only 
the second term which is proportional to the squared error. 



2.5 Specifying a Bayesian Prior 

Of course, giving an identical solution for w as least-squares, maximum likeli- 
hood estimation will also result in overfitting. To control the model complexity, 
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instead of the earlier regularisation weight penalty Ew (w) , we now define a 

which expresses our ‘degree of belief’ over values that w might take: 

M 1/9 

P(w|a) = n exp|-|w^|. (10) 

m—1 

This (common) choice of a zero-mean Gaussian prior, expresses a preference 
for smoother models by declaring smaller weights to be . more probable. 
Though the prior is independent for each weight, there is a shared inverse vari- 
ance hyperparameter a, analogous to A earlier, which moderates the strength of 
our ‘belief’. 



2.6 Posterior Inference 

Previously, given our error measure and regulariser, we computed a single 

wls for the weights. Now, given the likelihood and the prior, we com- 
pute the over w via Bayes’ rule: 



p{w\t,a,a^) 



likelihood X prior p(t|w, cr^)p(w|a) 
normalising factor p(t|a, cr^) 



( 11 ) 



As a consequence of combining a Gaussian prior and a linear model within a 
Gaussian likelihood, the posterior is also conveniently Gaussian: p(w|t,a,cr^) = 
iV(/x, S) with 



(1 = (#^# + cr^al)-!#^, (12) 

S = + (13) 

So instead of ‘learning’ a single value for w, we have inferred a distribution 
over all possible values. In effect, we have updated our prior ‘belief’ in the pa- 
rameter values in light of the information provided by the data t, with more 
posterior probability assigned to values which are both probable under the prior 
and which ‘explain the data’. 

MAP Estimation: A ‘Bayesian’ Short-Cut. The “maximum ” 

(MAP) estimate for w is the single most probable value under the posterior 
distribution p(w|t, a, ct^). Since the denominator in Bayes’ rule (11) earlier is 
independent of w, this is equivalent to maximising the numerator, or equiv- 
alently minimising EMApi'^) = — logp(t|w, cr^) — logp(w|a). Retaining only 
those terms dependent on w gives: 

N M 

Emap{^) = ^ XI “ 2/(a;n; w)} V I X (14) 

n—1 m—1 

The MAP estimate is therefore identical to the PLS estimate with A = a^a. 
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Illustration of Sequential Bayesian Inference. For our example problem, 
we’ll end this section by looking at how the posterior p(w|t, a, evolves as 
we observe increasingly more data points t„- Before proceeding, we note that 
we can compute the posterior incrementally since here the data are assumed 
independent (conditioned on w). for t = 

(X p{ti,t2,h\w) p{w), 

= p{t2,t3\w) p{ti\w) p(w), 

= Likelihood of (^ 2 ,^ 3 ) x posterior having observed ti. 

So, more generally, we can treat the posterior having observed (ti, . . . ,tx) 
as the ‘prior’ for the remaining data (tx+i, ■ ■ ■ ,iN) and obtain the equivalent 
result to seeing all the data at once. We exploit this result in Figure 4 where we 
illustrate how the posterior distribution updates with increasing amounts of data. 

The second row in Figure 4 illustrates some relevant points. First, because 
the data observed up to that point are not generally near the centres of the 
two basis functions visualised, those values of x are relatively uninformative 
regarding the associated weights and the posterior thereover has not deviated 
far from the prior. Second, on the far right in the second row, we can see that 
the function is fairly well determined in the vicinity of the observations, but at 
higher values of x, where data are yet to be observed, the MAP estimate of the 
function is not accurate and the posterior samples there exhibit high variance. 
On the third row we have observed all data, and notice that although the MAP 
predictor appears subjectively good, the posterior still seems quite diffuse and 
the variance in the samples is noticeable. We emphasise this point in the bottom 
row, where we have generated and observed an extra 200 data points and it can 
be seen how the posterior is now much more concentrated, and samples from it 
are now quite closely concentrated about the MAP value. 

Note that this facility to sample from the prior or posterior is a very infor- 
mative feature of the Bayesian paradigm. For the posterior, it is a helpful way 
of visualising the remaining uncertainty in parameter estimates in cases where 
the posterior distribution itself cannot be visualised. Furthermore, the ability 
to visualise samples from the prior alone is very advantageous, as it offers us 
evidence to judge the appropriateness of our prior assumptions. No equivalent 
facility exists within the regularisation or penalty function framework. 

3 Marginalisation and Ockham’s Razor 

Since we have just seen that the . (MAP) and penalised 

least-squares (PLS) estimates are equivalent, it might be tempting to assume 
that the Bayesian framework is simply a probabilistic re-interpretation of clas- 
sical methods. This is certainly not the case! It is sometimes overlooked that 
the distinguishing element of Bayesian methods is really . , where 

instead of seeking to ‘estimate’ all ‘nuisance’ variables in our models, we attempt 
to integrate them out. As we will now see, this is a powerful component of the 
Bayesian framework. 




Bayesian Inference: Principles and Practice in Machine Learning 



49 





1 

0.5 

0 

- 0.5 

-1 



1.5 

1 

0.5 



- 0.5 

-1 

- 1.5 



0 ) 1 ^ 






basis functions 







Fig. 4. Illustration of the evolution of the posterior distribution as data is sequen- 
tially ‘absorbed’. The left column shows the data, with those points which have been 
observed so far crossed, along with a plot of the basis functions. The contour plots 
in the middle column show the prior/posterior over just two (for visualisation pur- 
poses) of the weights, wio and wii, corresponding to the highlighted basis functions 
on the left. The right hand column plots w) from a number of samples of w from 
the full prior/posterior, along with the posterior mean, or MAP, estimator (in thicker 
green/gray). From top to bottom, the number of data is increasing. Row 1 shows the 
a priori case for no data, row 2 shows the model after 8 examples, and row 3 shows 
the model after all 15 data points have been observed. Finally, the bottom row shows 
the case when an additional 200 data points have been generated and absorbed in the 
posterior model 
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3.1 Making Predictions 

First lets reiterate some of the previous section and consider how, having ‘learned’ 
from the training values t, we make a prediction for the value of t* given a new 
input datum x*: 

Framework Learned Quantity Prediction 

Classical y^pLS 

MAP Bayesian p(w|t, a, cr^) map,(^"^) 

True Bayesian p(w|t,a, cr^) |t, a, cr^) 

The first two approaches result in similar predictions, although the MAP 
Bayesian model does give a probability distribution for t* (which can be sampled 
from, see Figure 4). The mean of this distribution is the same as that of the 
classical predictor y(a;*; wp^s), since 'Wmap = wppg. 

However, the ‘true Bayesian’ way is to , or over, the 

uncertain variables w in order to obtain the . : 

p(t*|t,Q;,(T^) = yp(t*|w,(T^) p(w|t,a, cr^) dw. (15) 

This distribution p(t*|t,a,CT^) incorporates our uncertainty over the weights 
having seen t, by averaging the model probability for t* over all possible values 
of w. If we are unsure about the parameter settings, for example if there were 
very few data points, then p(w|t,a,CT^) and similarly p(t*|t, a, will be ap- 
propriately diffuse. The classical, and even MAP Bayesian, predictions take no 
account of how well-determined our parameters w really are. 



3.2 The General Bayesian Predictive Framework 

You way well find the presence of a and as conditioning variables in the 
predictive distribution, p(<* |t, a, cr^), in (15) rather disconcerting, and indeed, 
for any general model, if we wish to predict given some training data t, what 
we really, really want is p(f*|t). That is, we wish to integrate out variables 
not directly related to the task at hand. So far, we’ve only placed a prior over 
the weights w — to be truly, truly Bayesian, we should define p{a), a so-called 
, along with a prior over the noise level p{<J^)- Then the full posterior 
over ‘nuisance’ variables becomes: 



p(w,a,cr^|t) 



p(t|w, a'^)p{w\a)p{a)p{a'^) 

P(t) 



The denominator, or normalising factor, in (16) is the 
bility of the data: 




p(t|w, ct^)p(w|q;)p(q;)p((t^) dw da da^ , 



(16) 

proba- 



(17) 
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and is nearly always analytically intractable to compute! Nevertheless, as we’ll 
soon see, p(t) is a very useful quantity and can be amenable to effective approx- 
imation. 



3.3 Practical Bayesian Prediction 

Given the full posterior (16), Bayesian inference in our example regression model 
would proceed with: 

p(G|t) = J p{w,a,a'^\t) dw da da^ , (18) 

but as we indicated, we can’t compute either p(w, a, cr^ |t) or p(t*|t) analytically. 
If we wish to proceed, we must turn to some approximation strategy (and it is 
here that much of the Bayesian “voodoo” resides). A sensible approach might 
be to perform those integrations that are analytically computable, and then ap- 
proximate remaining integrations, perhaps using one of a number of established 
methods: 

— Type-II maximum likelihood (discussed shortly) 

— Laplace’s method (see, , [2]) 

~ Variational techniques (see, , [3, 4]) 

— Sampling ( [2, 5]) 

Much research in Bayesian inference has gone, and continues to go, into the 
development and assessment of approximation techniques, including those listed 
above. For the purposes of this article, we will primarily exploit the first of them. 



3.4 A Type-II Maximum Likelihood Approximation 

Here, using the product rule of probability, we can rewrite the ideal full posterior 
p(w, a, cr^ |t) as: 

p(w, a, (T^|t) = p(w|t, a, cr^) p(a,cr^|t). (19) 



The first term is our earlier weight posterior which we have already computed: 
p(w|t, a, (T^) ~ The second term p(a, cr^|t) we will approximate, ad- 

mittedly crudely, by a ^-function at its mode. . we find “most probable” values 
omp and cr^p which maximise: 



p{a,a‘^\t) 



p(t|g,cr^) p{a) p{a'^) 

P(t) 



(20) 



Since the denominator is independent of a and cr^, we only need maximise the 
numerator p(t\a,a'^)p{a)p{a'^). Furthermore, if we assume fiat, , 

priors over log a and log cr, then we equivalently just need to find the maximum 
of p(t\a,a'^). Assuming a fiat prior here may seem to be a computational con- 
venience, but in fact it is arguably our prior of choice since our model will be 
invariant to the scale of the target data (and basis set), which is almost always 
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an advantageous feature^. For example, our results won’t change if we measure 
t in metres instead of miles. We’ll return to the task of maximising p(t|a, cr^) in 
Section 3.6. 

3.5 The Approximate Predictive Distribution 

Having found omp and cr^p, our approximation to the predictive distribution 
would be: 

p(t*|t) = J p(w|t, a, cr^) p(a, cr^|t) dw da da"^, 

~ J p(t*|w,(T^) p(w|t, a, cr^) ^(aMP,o-Mp) da da"^, 

= j T(i*|w,(7Mp) p(w|t,o;MP,crMp) dw. (21) 

In our example earlier, recall that p(w|t, «mp 7 <^mp) ~ from which 

the approximate predictive distribution can be finally written as: 

p{t*\t) « J p(t*|w,(jMp) p(w|t,aMP,CTMp) dw. (22) 

This is now computable and is Gaussian: A(^*,crJ), with: 



= j/(a;*;/x), 

= <^MP + 

where f = . . . , 4>m{x*)Y ■ Intuitively, we see that 

~ the mean predictor /i* is the model function evaluated with the posterior 
mean weights (the same as the MAP prediction), 

— the predictive variance is the sum of variances associated with both the 
noise process and the uncertainty of the weight estimates. In particular, it 
can be clearly seen that when the posterior over w is more diffuse, and Id is 
larger, a1 is also increased. 

3.6 Marginal Likelihood 

Returning now to the question of finding omp and cr^p, as noted earlier we find 

the maximising values of the ‘marginal likelihood’ p{t\a,aY- This is given by: 

^ Note that for scale parameters such as a and a^, it can be shown that it is appropriate 
to define uniformative priors uniformly over a logarithmic scale [6] . While for brevity 
we will continue to denote parameters “a” and “u” , from now on we will work with 
the logarithms thereof, and in particular, will maximise distributions with respect to 
log a and log a. In this respect, one must note with caution that finding the maximum 
of a distribution with respect to parameters is not invariant to transformations 
of those parameters, whereas the result of integration with respect to transformed 
distributions is invariant. 
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p{t\a, a^) = J p(t|w, cr^) p{w\a) dw, 

= (27r)“'^/^|cr^I + exp — -t'^(cr^I + 

(23) 

This is a Gaussian distribution over the single Ai-dimensional dataset vector 
t, and (23) is readily evaluated for arbitrary values of a (and cr^). Note that 
here we can use to directly determine Omp and cr^p — we don’t 

need to reserve a separate data set to validate their values. We can use gradient- 
based techniques to maximise (23) (and we will do so for a similar quantity in 
Section 4), but here we choose to repeat the earlier experiment for the regularised 
linear model. While we fix (though we could also experimentally evaluate it), 
in Figure 5 we have computed the marginal likelihood (in fact, its negative 
logarithm) at a number of different values of a (for just the 15-example training 
set, though we could also have made use of the validation set too) and compared 
with the training, validation and test errors of Figure 3 earlier. 

It is quite striking that , the 

Bayesian approach for setting a (giving test error 1.66) finds a closer model 
to the ‘truth’ than the classical model with its validated value of A (test error 
2.33). It is also interesting to see, and is it not immediately obvious why, that the 
marginal likelihood measure, although only measured on the training data, is not 
monotonic (unlike training error) and exhibits a maximum at some intermediate 
complexity level. The marginal likelihood criterion appears to be successfully 
penalising , models that are too simple too complex — this is “Ockham’s 
Razor” at work. 




Fig. 5. Plots of the training, validation and test errors of the model as shown in Figure 
3 (with the horizontal scale adjusted appropriately to convert from A to a) along with 
the negative log marginal likelihood evaluated on the training data alone for that same 
model. The values of a and test error achieved by the model with highest marginal 
likelihood (smallest negative log) are indicated 
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3.7 Ockham’s Razor 

In the fourteenth century, William of Ockham proposed: 

which literally translates as “entities should not be multiplied unnecessarily”. 
Its original historic context was theological, but the concept remains relevant 
for machine learning today, where it might be translated as “models should be 
no more complex than is sufficient to explain the data” . The Bayesian procedure 
is effectively implementing “Ockham’s Razor” by assigning lower probability 
to models that are too simple too complex. We might ask: why is an interme- 
diate value of a preferred? The schematic of Figure 6 shows how this can be the 
case, as a result of the marginal likelihood p(t|a) being a normalised distribution 
over the space of all possible data sets t. Models with high a only fit (assign sig- 
nificant marginal probability to) data from smooth functions. Models with low 
values of a can fit data generated from functions that are both smooth and com- 
plex. However, because of normalisation, the low-a model must generally assign 
lower probability to data from smooth functions, so the marginal likelihood natu- 
rally prefers the simpler model if the data is smooth, which is precisely the mean- 
ing of Ockham’s Razor. Furthermore, one can see from Figure 6 that for a data 
set of ‘intermediate’ complexity, a ‘medium’ value of ot can be preferred. This is 
qualitatively analogous to the case of our example set, where we indeed find that 
an intermediate value of a is optimal. Note, crucially, that this is achieved with- 
out any prior preference for any particular value of a as we originally assumed 

k Marginal Model 
Probability p(t | a) 
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Noisy Sine Data 

? 

1 

1 

1 




Medium a 

H 






1 

1 


Low a 


i 

1 

i_ 





Space of ali data sets t 
(increasing complexity -►) 



Fig. 6. A schematic plot of three marginal probability distributions for ‘high’, ‘medium’ 
and ‘low’ values of a. The figure is a simplification of the case for the actual distri- 
bution p(t|a), where for illustrative purposes the N-dimensional space of t has been 
compressed onto a single axis and where, notionally, data sets (instances of t) arising 
from simpler (smoother) functions lie towards the left-hand end of the horizontal scale, 
and data from complex functions to the right 
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a uniform hyperprior over its logarithm. The effect of Ockham’s Razor is an 
automatic and pleasing consequence of applying the Bayesian framework. 



3.8 Model Selection 



While we have concentrated so far on the search for an appropriate value of hy- 
perparameter a (and, to an extent, ct^), our model is also conditioned on other 
variables we have up to now overlooked: the choice of basis set ^ and, for our 
Gaussian basis, its width parameter r (as defined in Section 2.1). Ideally, we 
should define priors P{^) and p(r), and integrate out those variables when mak- 
ing predictions. More practically, we could use p(t|^,r) as a criterion for 

with the expectation that Ockham’s Razor will assist us in selecting 
a model that is sufficient to explain the data but is not over-complex. In our 
example model, we previously optimised the marginal likelihood to find a value 
for a. In fact, as there are only two nuisance parameters here, it is feasible to 
integrate out a and cr^ numerically. 

In Figure 7 we evaluate several basis sets ^ and width values r by computing 
the integral 

p(t|#, r) = f p(t|a, r) p{a) p{cr^) da da^, (24) 



^'^p{t\as 

^ S^l 






(25) 



with a Monte-Carlo average where we obtain S samples log-uniformly from a G 
[10-12, 1Q12] and CT e [10-^ 10°]. 

The results of Figure 7 are quite compelling: with uniform priors over all 
nuisance variables — : we have imposed ^ — we 

observe that test error appears very closely related to marginal likelihood. The 
qualitative shapes of the curves, and the relative merits, of Gaussian and Lapla- 
cian basis functions are also captured. For the Gaussian basis we are very close 
to obtaining the optimal value of r, in terms of test error, from just 15 exam- 
ples and no validation data. Reassuringly, the simplest model that contains the 
‘truth’, y = wi sin(a;), is the most probable model here. We also show in the fig- 
ure the model y = Wi sin(a;) +W 2 cos(x) which is also an ideal fit for the data, but 
it is penalised in marginal probability terms since the addition of the W 2 cos(a;) 
term allows it to explain more data sets, and normalisation thus requires it to 
assign less probability to our particular set. Nevertheless, it is still some orders 
of magnitude more probable than the Gaussian basis model. 



3.9 Summary So Far. . . 

Marginalisation is the key element of Bayesian inference, and hopefully some of 
the examples above have persuaded the reader that it can be an exceedingly 
powerful one. Problematically though, ideal Bayesian inference proceeds by in- 
tegrating out all irrelevant variables, and we must concede that 
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Fig. 7 . Top: negative log model probability — logp(t|tl5^ r) for various basis sets, eval- 
uated by analytic integration over w and Monte-Carlo averaging over a and a^. Bot- 
tom: corresponding test error for the posterior mean predictor. Basis sets examined 
were ‘Gaussian’, exp { — |a; — ‘Laplacian’, exp { — |x — x„i|/r}, sin(x), sin(x) 

with cos(x). For the Gaussian and Laplacian basis, the horizontal axis denotes vary- 
ing ‘width’ parameter r shown. For the sine/cosine bases, the horizontal axis has no 
significance and the values are placed to the left for convenience 

— for practical purposes, it may be appropriate to require point estimates of 
some ‘nuisance’ variables, since it could easily be impractical to average 
over many parameters and particularly models every time we wish to make 
a prediction (imagine, for example, running a handwriting recogniser on a 
portable computing device), 

— many of the desired integrations necessitate some form of approximation. 

Nevertheless, regarding these points, we can still leverage Bayesian techniques 
to considerable benefit exploiting carefully-applied approximations. In particu- 
lar, marginalised likelihoods within the Bayesian framework allow us to estimate 
fixed values of hyperparameters where desired and, most beneficially, choose be- 
tween models and their varying parameterisations. This can all be done without 
the need to use validation data. Furthermore: 

— it is straightforward to estimate other parameters in the model that may be 
of interest, the noise variance, 

— we can sample from both prior and posterior models of the data, 

— the exact parameterisation of the model is irrelevant when integrating out, 

— we can incorporate other priors of interest in a principled manner. 
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We now further demonstrate these points, notably the last one, in the next 
section where we present a practical framework for the inference of ‘sparse’ 
models. 

4 Sparse Bayesian Models 

4.1 Bayes and Contemporary Machine Learning 

In the previous section we saw that marginalisation is a valuable component of 
the Bayesian paradigm which offers a number of advantageous features applicable 
to many data modelling tasks. Disadvantageously, we also saw that the integra- 
tions required for full Bayesian inference can often be analytically intractable, 
although approximations for simple linear models could be very effective. Histor- 
ically, interest in Bayesian “machine learning” (but not statistics!) has focused 
on approximations for models, for neural networks, the “evidence 

procedure” [7] and “hybrid Monte Carlo” sampling [5]. More recently, flexible 
( many-parameter) linear kernel methods have attracted much renewed in- 
terest, thanks mainly to the popularity of the “support vector machine”. These 
kind of models, of course, are particularly amenable to Bayesian techniques. 

Linear Models and Sparsity. Much interest in linear models has focused on 
learning algorithms, which set many weights Wm to zero in the estimated 
predictor function y{x) = Sparsity is an attractive concept; it 

offers elegant complexity control, feature extraction, the potential for elucidation 
of meaningful input variables along with the practical benefits of computational 
speed and compactness. 

How do we impose a preference for sparsity in a model? The most common 
approach is via an appropriate regularisation term or prior. The most common 
regularisation term that we have already met, i?vv(w) = of course 

corresponds to a Gaussian prior and is easy to work with, but while it is an effec- 
tive way to control complexity, it does not promote sparsity. In the regularisation 
sense, the ‘correct’ term would be Ewi'w) = but this, being discon- 
tinuous in Wm, is very difficult to work with. Instead, Ewi'vv) = \wm\^ is a 

workable compromise which gives reasonable sparsity and reasonable tract abil- 
ity, and is exploited in a number of methods, including as a Laplacian prior 
p(w) (X exp(— |wm|) [8]. However, there is an arguably more elegant way 
of obtaining sparsity within a Bayesian framework that builds effectively on the 
ideas outlined in the previous section and we conclude this article with a brief 
outline thereof. 

4.2 A Sparse Bayesian Prior 

In fact, we obtain sparsity by retaining the traditional Gaussian prior, which 
is great news for tractability. The modification to our earlier Gaussian prior (10) 
is subtle: 

M 

p{w\ai, . . .,OiM) = 

m—1 



(27t) 



- 1 / 2 ^ 1/2 



amr'~‘SXp\ -^CimWm 



(26) 
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In contrast to the model in Section 2, we now have M hyperparameters 
a = om), one am independently controlling the (inverse) variance of 

each weight Wm- 



A Hierarchical Prior. The prior p(w|a) is nevertheless still Gaussian, and 
superficially seems to have little preference for sparsity. However, it remains 
conditioned on a, so for full Bayesian consistency we should now define hyper- 
priors over all am- Previously, we utilised a log-uniform hyperprior — this is a 
special case of a hyperprior, which we introduce for greater generality 

here. This combination of the prior over am controlling the prior over Wm gives 
us what is often referred to as a , prior. Now, if we have p{wm\o^m) 

and p{am) and we want to know the ‘true’ p{wm) we already know what to do 
— we must marginalise: 

p{Wm) = J p{Wm\ctm) p{c(m) dam- (27) 

For a Gamma p{am), this integral is computable and we find that p{wm) is 
a distribution illustrated as a function of two parameters in Figure 8; 

its equivalent as a regularising penalty function would be \ wm\- 



4.3 A Sparse Bayesian Model for Regression 

We can develop a sparse regression model by following an identical methodology 
to the previous sections. Again, we assume independent Gaussian noise: ^ 

A(j/(x„; w), (T^), which gives a corresponding likelihood: 

p(t|w,cr^) = ( 27 rcr 2 )^^/ 2 exp|-^||t-#wf|, ( 28 ) 

where as before we denote t = (ti . . . w = {wi . . . wmY, and ^ is the 
N X M ‘design’ matrix with ^nm = 4>m{'^n)- 



Gaussian prior Marginal prior: single a Independent a 




Fig. 8. Contour plots of Gaussian and Student-t prior distributions over two param- 
eters. While the marginal prior p{wi,W 2 ) for the ‘single’ hyperparameter model of 
Section 2 has a much sharper peak than the Gaussian at zero, it can be seen that it 
is not sparse unlike the multiple ‘independent’ hyperparameter prior, which as well 
as having a sharp peak at zero, places most of its probability mass along axial ridges 
where the magnitude of one of the two parameters is small 
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Following the Bayesian framework, we desire the posterior distribution over 
all unknowns: 

p(t|w, a, cr^)p(w, cc, cr^) 



p(w,o:,cr |t) = 



P(t) 



(29) 



which we can’t compute analytically. So as previously, we decompose this as: 

p(w, a, a^\t) = p(w|t, a, a^) p{ot, a^\t) (30) 



where p(w|t, a, cr^) is the ‘weight posterior’ distribution, and is tractable. This 
leaves p{oL,a‘^\t) which must be approximated. 



The Weight Posterior Term. Given the data, the posterior distribution over 
weights is Gaussian: 



, I. 2^ P(t|w,cr2)p(w|o:) 

= (2,)-(»+')/2|rr‘/"exp |-i(w - rt'S-yw - rt} , (31) 



with 



i: = (cr-2^T^ + A)-\ (32) 

p. = (33) 

and where we collect all the hyperparameters into a diagonal matrix: A = 
diag(o;i, 0 ( 2 , ■ • ■ , olm)- A key point to note from (31-33) is that if any am = oo, 
the corresponding pm = 0. 

The Hyperparameter Posterior Term. Again we will adopt the “type-II 
maximum likelihood” approximation where we maximise p(t|o:, cr^) to find ckmp 
and cr^p. As before, for uniform hyperpriors over logo and log cr, p{<y.,a‘^\t) (x 
p(t|o:,cr^), where the p(t|o:,cr^) is obtained by integrating 

out the weights: 

p(t|o:,(7^) = J p(t|w,cr^)p(w|o:) dw, 

= (27t)^^/V^I + exp |-it^(cr^I + 

(34) 

In Section 2, we found the single «mp empirically but here for multiple (in 
practice, perhaps thousands of) hyperparameters, we cannot experimentally ex- 
plore the space of possible a so we instead optimise p{t\a,ay directly, via a 
gradient-based approach. 
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Hyperparameter Re-estimation. Differentiating logp(t|o;, with respect 
to a and setting to zero and rearranging (see [9]) ultimately gives iterative 
re-estimation formulae: 



new 

<^i = 

(^2)„ew ^ 

For convenience we have defined 

= 1 - aiEu, (37) 

where 7i G [Oj 1] is a measure of ‘well-determinedness’ of parameter Wi. This 
quantity effectively captures the influence of the likelihood (total when 7 — > 1) 
and the prior (total when 7 ^ 0) on the value of each Wi. Note that the quantities 
on the right-hand-side of equations (35-37) are computed using the ‘old’ values 
of ot and 






2 > 



#p|i' 



AT 

N - Ei=i 



(35) 

(36) 



Summary of Inference Procedure. We’re now in a position to define a 
‘learning algorithm’ for approximate Bayesian inference in this model: 

1. Initialise all {oj} and (or fix latter if known) 

2. Compute weight posterior sufficient statistics fi and E 

3. Compute all {7*}, then re-estimate {ai} (and if desired) 

4. Repeat from 2. until convergence 

5. ‘Delete’ weights (and basis functions) for which optimal ai = 00, since this 
implies Hi = 0 

6. Make predictions for new data via the predictive distribution computed with 
the converged q:mp and 

p(<*|t) = J p{t^\w,aMp) p{w\t,aMP,o-Mp) dw (38) 

the mean of which is y(x*; /x) 

Step 5. rather ideally assumes that we can reliably estimate such large values 
of a, whereas in reality limited computational precision implies that in this 
algorithm we have to place some finite upper limit on a ( 10^^ times the 

value of the smallest a). In many real-world tasks, we do indeed find that many 
ai do tend towards infinity, and we converge toward a model that is very sparse, 
even if M is very large. 

4.4 The “Relevance Vector Machine” (RVM) 

To give an example of the potential of the above model, we briefly introduce 
here the “Relevance Vector Machine” (RVM), which is simply a specialisation 
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of a sparse Bayesian model which utilises the same data-dependent kernel basis 
as the popular “support vector machine” (SVM): 

N 

y(x;w) = + u;o (39) 

n—1 

This model is described, with a number of examples, in much more detail 
elsewhere [9]. For now. Figure 9 provides an illustration, on some noise-polluted 
synthetic data, of the potential of this Bayesian framework for effectively com- 
bining sparsity with predictive accuracy. 



Relevance Vector Regression 



Support Vector Regression 




Fig. 9. The relevance vector and snpport vector machines applied to a regression 
problem using a Gaussian kernel, which demonstrates some of the advantages of the 
Bayesian approach. Of particular note is the sparsity of the hnal Bayesian model, 
which qualitatively appears near-optimal. It is also worth nnderlining that the ‘nui- 
sance’ parameters C and e for the SVM had to be found by a separate cross-validation 
procedure, whereas the RVM algorithm estimates them automatically, and arguably 
quite accurately in the case of the noise variance 



5 Summary 

While the tone of the first three sections of this article has been introductory 
and the models considered therein have been quite simplistic, the brief example 
of the ‘sparse Bayesian’ learning procedure given in Section 4 is intended to 
demonstrate that ‘practical’ Bayesian inference procedures have the potential to 
be highly effective in the context of modern machine learning. Readers who find 
this demonstration sufficiently convincing and who are interested specifically in 
the sparse Bayesian model framework can find further information (including 
some implementation code), and details of related approaches, at a web-page 
maintained by the author: http://www.research.microsoft.com/mlp/RVM. In 
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particular, note that the algorithm for hyperparameter estimation of Section 4.3 
was presented here as it has a certain intuitive simplicity, but in fact there is a 
much more efficient and practical approach to optimising logp(t|o:, which is 
detailed in [10]. 

We summarised some of the features, advantages and limitations of the gen- 
eral Bayesian framework earlier in Section 3.9, and so will not repeat them here. 
The reader interested in investigating further and in more depth on this general 
topic may find much helpful further material in the references [1, 5, 11, 12, 13, 14]. 
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Abstract. We give a basic introduction to Gaussian Process regression 
models. We focus on understanding the role of the stochastic process 
and how it is used to define a distribution over functions. We present 
the simple equations for incorporating training data and examine how 
to learn the hyperparameters using the marginal likelihood. We explain 
the practical advantages of Gaussian Process and end with conclusions 
and a look at the current trends in GP work. 



Supervised learning in the form of regression (for continuous outputs) and clas- 
sification (for discrete outputs) is an important constituent of statistics and 
machine learning, either for analysis of data sets, or as a subgoal of a more 
complex problem. 

Traditionally parametric^ models have been used for this purpose. These have 
a possible advantage in ease of interpretability, but for complex data sets, simple 
parametric models may lack expressive power, and their more complex counter- 
parts (such as feed forward neural networks) may not be easy to work with 
in practice. The advent of kernel machines, such as Support Vector Machines 
and Gaussian Processes has opened the possibility of flexible models which are 
practical to work with. 

In this short tutorial we present the basic idea on how Gaussian Process 
models can be used to formulate a Bayesian framework for regression. We will 
focus on understanding the stochastic process and how it is used in supervised 
learning. Secondly, we will discuss practical matters regarding the role of hyper- 
parameters in the covariance function, the marginal likelihood and the automatic 
Occam’s razor. For broader introductions to Gaussian processes, consult [1], [2]. 

1 Gaussian Processes 

In this section we define Gaussian Processes and show how they can very nat- 
urally be used to define distributions over functions. In the following section 
we continue to show how this distribution is updated in the light of training 
examples. 



^ By a parametric model, we here mean a model which during training “absorbs” the 
information from the training data into the parameters; after training the data can 
be discarded. 



O. Bousquet et al. (Eds.): Machine Learning 2003, LNAI 3176, pp. 63—71, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 
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Definition 1. 

r ; 

A Gaussian is fully specified by its mean function m{x) and covariance 

function k{x,x'). This is a natural generalization of the Gaussian 
whose mean and covariance is a vector and matrix, respectively. The Gaussian 
distribution is over vectors, whereas the Gaussian process is over functions. We 
will write: 

/ ~ gr{m, k), (1) 

meaning: “the function / is distributed as a GP with mean function m and 
covariance function fc” . 

Although the generalization from distribution to process is straight forward, 
we will be a bit more explicit about the details, because it may be unfamiliar 
to some readers. The individual random variables in a vector from a Gaussian 
distribution are indexed by their position in the vector. For the Gaussian process 
it is the argument a; (of the random function /(a;)) which plays the role of index 
set: for every input x there is an associated random variable f{x), which is the 
value of the (stochastic) function / at that location. For reasons of notational 
convenience, we will enumerate the x values of interest by the natural numbers, 
and use these indexes as if they were the indexes of the process - don’t let yourself 
be confused by this: the index to the process is Xi, which we have chosen to index 
by i- 

Although working with infinite dimensional objects may seem unwieldy at 
first, it turns out that the quantities that we are interested in computing, require 
only working with finite dimensional objects. In fact, answering questions about 
the process reduces to computing with the related distribution. This is the key 
to why Gaussian processes are feasible. Let us look at an example. Gonsider the 
Gaussian process given by: 

/ ~ QV{m, k), where m{x) = jx‘^, a,nd k{x,x') = exp{—^{x — x')"^). (2) 

In order to understand this process we can draw samples from the function 
/. In order to work only with finite quantities, we request only the value of / at 
a distinct finite number n of locations. How do we generate such samples? Given 
the x-values we can evaluate the vector of means and a covariance matrix using 
Eq. (2), which defines a regular Gaussian distribution: 

jjL^ =m{xi) = jxj, z=l,...,n and 
Eij =k{xi,Xj) = exp{-\{xi- XjY), z,j = l,...,n, 

where to clarify the distinction between process and distribution we use m and 
k for the former and p, and E for the latter. We can now generate a random 
vector from this distribution. This vector will have as coordinates the function 
values f{x) for the corresponding x's: 



f ~ Af(/i,A). 



( 4 ) 
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Fig. 1. Function values from three functions drawn at random from a GP as specified 
in Eq. (2). The dots are the values generated from Eq. (4), the two other curves have 
(less correctly) been drawn by connecting sampled points. The function values suggest 
a smooth underlying function; this is in fact a property of GPs with the squared 
exponential covariance function. The shaded grey area represent the 95% confidence 
intervals 



We could now plot the values of / as a function of x, see Figure 1. How can 
we do this in practice? Below are a few lines of Matlab^ used to create the plot: 

xs = (-5:0.2:5)’; ns = size(xs,l); keps = le-9; 
m = inline (’ 0 . 25*x . “2 ’) ; 

K = inline ( ’ exp(-0 . 5* (repmat (p’ ’ , size (q) ) -repmat (q, size (p’ ’ ) ) ) . ~ 2 ) ’ ) ; 
fs = m(xs) + chol(K(xs ,xs)+keps*eye(ns) ) ’ *randn(ns , 1) ; 
plot (xs ,f s , ’ . ’ ) 

In the above example, m and k are mean and covariances; chol is a function 
to compute the Cholesky decomposition^ of a matrix. 

This example has illustrated how we move from process to distribution and 
also shown that the Gaussian process defines a distribution over functions. Up 
until now, we have only been concerned with random functions - in the next 
section we will see how to use the GP framework in a very simple way to make 
inferences about functions given some training examples. 



2 Posterior Gaussian Process 

In the previous section we saw how to define distributions over functions using 
GPs. This GP will be used as a . for Bayesian inference. The prior does not 
depend on the training data, but specifies some properties of the functions; for 



^ Matlab is a trademark of The MathWorks Inc. 

® We’ve also added a tiny keps multiple of the identity to the covariance matrix 
for numerical stability (to bound the eigenvalues numerically away from zero); see 
comments around Eq. (8) for a interpretation of this term as a tiny amount of noise. 
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example, in Figure 1 the function is smooth, and close to a quadratic. The goal 
of this section is to derive the simple rules of how to update this prior in the 
light of the training data. The goal of the next section is to attempt to learn 
about some properties of the prior‘s in the the light of the data. 

One of the primary goals computing the posterior is that it can be used to 
make predictions for unseen test cases. Let f be the known function values of 
the training cases, and let f* be a set of function values corresponding to the 
test set inputs, X*. Again, we write out the joint distribution of everything we 
are interested in: 



Af 



S A* 
rJ A* 



( 5 ) 



where we’ve introduced the following shorthand: fj, = m{xi), i = 1, . . . , n for the 
training means and analogously for the test means /x*; for the covariance we 
use S for training set covariances, A* for training-test set covariances and A** 
for test set covariances. Since we know the values for the training set f we are 
interested in the conditional distribution of f* given f which is expressed as^: 



f*|f ~ + Sj - fi), (6) 

This is the posterior distribution for a specific set of test cases. It is easy to 
verify (by inspection) that the corresponding posterior process is: 

f\V ~ gV{mv, kv), 

m'u{x) = m{x) + S{X,xy' — m) (7) 

kx>{x,x') = k{x,x') — S{X,x)^ S{X^x'), 

where X(X,x) is a vector of covariances between every training case and x. 
These are the central equations for Gaussian process predictions. Let’s examine 
these equations for the posterior mean and covariance. Notice that the posterior 
variance kx>{x,x) is equal to the prior variance k{x,x) minus a positive term, 
which depends on the training inputs; thus the posterior variance is always 
smaller than the prior variance, since the data has given us some additional 
information. 

We need to address one final issue: noise in the training outputs. It is common 
to many applications of regression that there is noise in the observations®. The 
most common assumption is that of additive i.i.d. Gaussian noise in the outputs. 



^ By definition, the prior is independent of the data; here we’ll be using a hierarchical 
prior with free parameters, and make inference about the parameters. 

® the formula for conditioning a joint Gaussian distribution is: 



X 

y 




A C 
B 



x|y 



M{a + CB~^{y -h), A-CB-^C^). 



However, it is perhaps interesting that the GP model works also in the noise-free 
case - this is in contrast to most parametric methods, since they often cannot model 
the data exactly. 
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Fig. 2. Three functions drawn at random from the posterior, given 20 training data 
points, the QV as specified in Eq. (3) and a noise level of fr„ = 0.7. The shaded area 
gives the 95% confidence region. Compare with Figure 1 and note that the uncertainty 
goes down close to the observations 

In the Gaussian process models, such noise is easily taken into account; the 
effect is that every f{x) has a extra covariance with itself only (since the noise 
is assumed independent), with a magnitude equal to the noise variance: 

y{x) =f{x)+e, e ~ 

o (o) 

/ r^gv{m,k), y ~ gV{m,k + 

where 6ui = 1 iff f is the Kronecker’s delta. Notice, that the indexes to the 
Kronecker’s delta is the identify of the cases, i, and not the inputs xy, you may 
have several cases with identical inputs, but the noise on these cases is assumed 
to be independent. Thus, the covariance function for a noisy process is the sum 
of the signal covariance and the noise covariance. 

Now, we can plug in the posterior covariance function into the little Matlab 
example on page 65 to draw samples from the posterior process, see Figure 2. In 
this section we have shown how simple manipulations with mean and covariance 
functions allow updates of the prior to the posterior in the light of the training 
data. However, we left some questions unanswered: How do we come up with 
mean and covariance functions in the first place? How could we estimate the 
noise level? This is the topic of the next section. 



3 Training a Ganssian Process 

In the previous section we saw how to update the prior Gaussian process in the 
light of training data. This is useful if we have enough prior information about 
a dataset at hand to confidently specify prior mean and covariance functions. 
However, the availability of such detailed prior information is not the typical case 
in machine learning applications. In order for the GP techniques to be of value 
in practice, we must be able to chose between different mean and covariance 
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functions in the light of the data. This process will be referred to as ' 

the GP model. 

In the light of typically vague prior information, we use a hierarchical prior, 
where the mean and covariance functions are parameterized in terms of hyper- 
parameters. For example, we could use a generalization of Eq. (2): 



/ ~ gV{m, k), 



m(x) = ax^ + bx + c, and k{x, x') 



exp ( 



{x — x'Y 
2£2 



) + 



oik. 



(9) 



where we have introduced , Q = {a, &, c, (Ty, (j„, £}. The purpose 

of this hierarchical specification is that it allows us to specify vague prior infor- 
mation in a simple way. For example, we’ve stated that we believe the function 
to be close to a second order polynomial, but we haven’t said exactly what 
the polynomial is, or exactly what is meant by “close”. In fact the discrepancy 
between the polynomial and the data is a smooth function plus independent 
Gaussian noise, but again we’re don’t need exactly to specify the characteristic 
length scale I or the magnitudes of the two contributions. We want to be able 
to make inferences about all of the hyperparameters in the light of the data. 

In order to do this we compute the probability of the data given the hyperpa- 
rameters. Fortunately, this is not difficult, since by assumption the distribution 
of the data is Gaussian: 



L = logp(y|x,0) = -ilog|r|-i(y-/x)Tr-i(y-/x)-flog(2^). (10) 



We will call this quantity the log . , . We use the term 

“marginal” to emphasize that we are dealing with a non-par ametric model. See 
e.g. [1] for the weight-space view of Gaussian processes which equivalently leads 
to Eq. (10) after marginalization over the weights. 

We can now find the values of the hyperparameters which optimizes the 
marginal likelihood based on its partial derivatives which are easily evaluated: 



dh 

Wn 

dL 

dOk 



- (y - 



dm 

d9m 



5 trace (r- g^) + 5(y-/t) 



( 11 ) 



where 9m and 9k are used to indicate hyperparameters of the mean and covari- 
ance functions respectively. Eq. (11) can conveniently be used in conjunction 



^ Training the GP model involves both model selection, or the discrete choice between 
different functional forms for mean and covariance functions as well as adaptation 
of the hyperparameters of these functions; for brevity we will only consider the 
latter here - the generalization is straightforward, in that marginal likelihoods can 
be compared. 
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Fig. 3. Mean and 95% posterior confidence region with parameters learned by maxi- 
mizing marginal likelihood, Eq. (10), for the Gaussian process specification in Eq. (9), 
for the same data as in Eigure 2. The hyperparameters found were a = 0.3, b = 0.03, c = 
— 0.7, £ = 0.7, Uj, = 1.1, = 0.25. This example was constructed so that the approach 

without optimization of hyperparameters worked reasonably well (Figure 2), but there 
is of course no guarantee of this in a typical application 

with a numerical optimization routine such as conjugate gradients to find good® 
hyperparameter settings. 

Due to the fact that the Gaussian process is a non-parametric model, the 
marginal likelihood behaves somewhat differently to what one might expect from 
experience with parametric models. Note first, that it is in fact very easy for the 
model to fit the training data exactly: simply set the noise level cr^ to zero, and 
the model produce a mean predictive function which agrees exactly with the 
training points. However, this is not the typical behavior when optimizing the 
marginal likelihood. Indeed, the log marginal likelihood from Eq. (10) consists 
of three terms: The first term, — iloglifl is a . , which 

measures and penalizes the complexity of the model. The second term a nega- 
tive quadratic, and plays the role of a data-fit measure (it is the only term which 
depends on the training set output values y). The third term is a log normaliza- 
tion term, independent of the data, and not very interesting. Figure 3 illustrates 
the predictions of a model trained by maximizing the marginal likelihood. 

Note that the tradeoff between penalty and data-fit in the GP model is auto- 
matic. There is no weighting parameter which needs to be set by some external 
method such as cross validation. This is a feature of great practical importance, 
since it simplifies training. Figure 4 illustrates how the automatic tradeoff comes 
about. 

We’ve seen in this section how we, via a hierarchical specification of the prior, 
can express prior knowledge in a convenient way, and how we can learn values 
of hyperparameters via optimization of the marginal likelihood. This can be 
done using some gradient based optimization. Also, we’ve seen how the marginal 



Note, that for most non-trivial Gaussian processes, optimization over hyperparam- 
eters is not a convex problem, so the nsual precautions against bad local minima 
should be taken. 
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Fig. 4. Occam’s razor is automatic. On the a;-axis is an abstract representation of all 
possible datasets (of a particular size). On the i/-axis the probability of the data given 
the model. Three different models are shown. A more complex model can account for 
many more data sets than a simple model, but since the probabilities have to integrate 
to unity, this means more complex models are automatically penalized more 

likelihood automatically incorporates Occam’s razor; this property of of great 
practical importance, since it simplifies training a lot. 



4 Conclusions and Future Directions 

We’ve seen how Gaussian processes can conveniently be used to specify very flex- 
ible non-linear regression. We only mentioned in passing one type of covariance 
function, but in fact any positive definite function® can be used as covariance 
function. Many such functions are known, and understanding the properties of 
functions drawn from GPs with particular covariance functions is an impor- 
tant ongoing research goal. When the properties of these functions are known, 
one will be able to chose covariance functions reflecting prior information, or 
alternatively, one will be able to interpret the covariance functions chosen by 
maximizing marginal likelihood, to get a better understanding of the data. 

In this short tutorial, we have only treated the simplest possible case of 
regression with Gaussian noise. In the case of non-Gaussian likelihoods (such as 
e.g. needed for classification) training becomes more complicated. One can resort 
to approximations, such as the Laplace approximation [3], or approximations 
based on projecting the non-Gaussian posterior onto the closest Gaussian (in a 
KL sense) [4] or sampling techniques [5]. 



® The covariance function must be positive definite to ensure that the resulting co- 
variance matrix is positive definite. 
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Another issue is the computational limitations. A straightforward implemen- 
tation of the simple techniques explained here, requires inversion of the covari- 
ance matrix A, with a memory complexity of 0{'n?) and a computational com- 
plexity of 0{n^). This is feasible on a desktop computer for dataset sizes of n 
up to a few thousands. Although there are many interesting machine learning 
problems with such relatively small datasets, a lot of current work is going into 
the development of approximate methods for larger datasets. A number of these 
methods rely on sparse approximations. 
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Abstract. We give a tutorial and overview of the field of unsupervised 
learning from the perspective of statistical modeling. Unsupervised learn- 
ing can be motivated from information theoretic and Bayesian principles. 
We briefly review basic models in unsupervised learning, including fac- 
tor analysis, PCA, mixtures of Gaussians, ICA, hidden Markov models, 
state-space models, and many variants and extensions. We derive the 
EM algorithm and give an overview of fundamental concepts in graph- 
ical models, and inference algorithms on graphs. This is followed by a 
quick tour of approximate Bayesian inference, including Markov chain 
Monte Carlo (MCMC), Laplace approximation, BIG, variational approx- 
imations, and expectation propagation (EP). The aim of this chapter is 
to provide a high-level view of the field. Along the way, many state-of- 
the-art ideas and future directions are also reviewed. 



1 Introduction 

Machine learning is the field of research devoted to the formal study of learn- 
ing systems. This is a highly interdisciplinary field which borrows and builds 
upon ideas from statistics, computer science, engineering, cognitive science, op- 
timization theory and many other disciplines of science and mathematics. The 
purpose of this chapter is to introduce in a fairly concise manner the key ideas 
underlying the sub-field of machine learning known as . . 

This introduction is necessarily incomplete given the enormous range of topics 
under the rubric of unsupervised learning. The hope is that interested readers 
can delve more deeply into the many topics covered here by following some of 
the cited references. The chapter starts at a highly tutorial level but will touch 
upon state-of-the-art research in later sections. It is assumed that the reader is 
familiar with elementary linear algebra, probability theory, and calculus, but not 
much else. 

1.1 What Is Unsupervised Learning? 

Consider a machine (or living organism) which receives some sequence of inputs 
Xi, X2, X3 , . . ., where Xt is the sensory input at time t. This input, which we will 
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often call the , could correspond to an image on the retina, the pixels in a 
camera, or a sound waveform. It could also correspond to less obviously sensory 
data, for example the words in a news story, or the list of items in a supermarket 
shopping basket. 

One can distinguish between four different kinds of machine learning. In 
the machine^ is also given a sequence of desired outputs 
j/i, ?/2) • • ■ 5 and the goal of the machine is to learn to produce the correct output 
given a new input. This output could be a class label (in classification) or a real 
number (in regression). 

In the machine interacts with its environment by pro- 

ducing actions oi, 02, . . .. These actions affect the state of the environment, which 
in turn results in the machine receiving some scalar rewards (or punishments) 
ri, r2, . . .. The goal of the machine is to learn to act in a way that maximizes the 
future rewards it receives (or minimizes the punishments) over its lifetime. Rein- 
forcement learning is closely related to the fields of decision theory (in statistics 
and management science), and control theory (in engineering). The fundamental 
problems studied in these fields are often formally equivalent, and the solutions 
are the same, although different aspects of problem and solution are usually 
emphasized. 

A third kind of machine learning is closely related to and gen- 

eralizes reinforcement learning. Here again the machine gets inputs, produces 
actions, and receives rewards. However, the environment the machine interacts 
with is not some static world, but rather it can contain other machines which 
can also sense, act, receive rewards, and learn. Thus the goal of the machine is 
to act so as to maximize rewards in light of the other machines’ current and 
future actions. Although there is a great deal of work in game theory for simple 
systems, the dynamic case with multiple adapting machines remains an active 
and challenging area of research. 

Finally, in the machine simply receives inputs Xi , X2,. . ., 

but obtains neither supervised target outputs, nor rewards from its environment. 
It may seem somewhat mysterious to imagine what the machine could possibly 
learn given that it doesn’t get any feedback from its environment. However, it 
is possible to develop of formal framework for unsupervised learning based on 
the notion that the machine’s goal is to build representations of the input that 
can be used for decision making, predicting future inputs, efficiently communi- 
cating the inputs to another machine, etc. In a sense, unsupervised learning can 
be thought of as finding patterns in the data above and beyond what would be 
considered pure unstructured noise. Two very simple classic examples of unsu- 
pervised learning are clustering and dimensionality reduction. We discuss these 
in Section 2 . The remainder of this chapter focuses on unsupervised learning. 



^ Henceforth, for succinctness I’ll use the term machine to refer both to machines 
and living organisms. Some people prefer to call this a system or agent. The same 
mathematical theory of learning applies regardless of what we choose to call the 
learner, whether it is artificial or biological. 
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although many of the concepts discussed can be applied to supervised learning 
as well. But first, let us consider how unsupervised learning relates to statistics 
and information theory. 



1.2 Machine Learning, Statistics, and Information Theory 

Almost all work in unsupervised learning can be viewed in terms of learning 
a probabilistic model of the data. Even when the machine is given no super- 
vision or reward, it may make sense for the machine to estimate a model that 
represents the probability distribution for a new input Xt given previous in- 
puts x\, . . . , Xt-i (consider the obviously useful examples of stock prices, or the 
weather). That is, the learner models P{xt\x \, . . . , Xt-i). In simpler cases where 
the order in which the inputs arrive is irrelevant or unknown, the machine can 
build a model of the data which assumes that the data points X\,X2,--- are 
independently and identically drawn from some distribution P(x)^. 

Such a model can be used for or . Let x represent 

patterns of sensor readings from a nuclear power plant and assume that P{x) 
is learned from data collected from a normally functioning plant. This model 
can be used to evaluate the probability of a new sensor reading; if this proba- 
bility is abnormally low, then either the model is poor or the plant is behaving 
abnormally, in which case one may want to shut it down. 

A probabilistic model can also be used for . Assume Pi{x) is a 

model of the attributes of credit card holders who paid on time, and ^2(2;) is 
a model learned from credit card holders who defaulted on their payments. By 
evaluating the relative probabilities Pi{x') and P2{x') on a new applicant x' , the 
machine can decide to classify her into one of these two categories. 

With a probabilistic model one can also achieve efficient and 

. Imagine that we want to transmit, over a digital communica- 
tion line, symbols x randomly drawn from P{x). For example, x may be letters of 
the alphabet, or images, and the communication line may be the Internet. Intu- 
itively, we should encode our data so that symbols which occur more frequently 
have code words with fewer bits in them, otherwise we are wasting bandwidth. 
Shannon’s source coding theorem quantifies this by telling us that the optimal 
number of bits to use to encode a symbol with probability P{x) is — log2 P{x). 
Using these number of bits for each symbol, the expected coding cost is the 
entropy of the distribution P. 

-Y,P{^)^og2P{x) ( 1 ) 

X 

In general, the true distribution of the data is unknown, but we can learn 
a model of this distribution. Let’s call this model Q{x). The optimal code , 



^ We will use both P and p to denote probability distributions and probability den- 
sities. The meaning should be clear depending on whether the argument is discrete 
or continuous. 
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would use — log 2 Q{x) bits for each symbol x. The expected 
coding cost, taking expectations with respect to the true distribution, is 

- ^ P{x) log 2 Q{x) (2) 

X 

The difference between these two coding costs is called the Kullback-Leibler 
(KL) divergence 

KL(P||Q)1i'^P(x)log^ (3) 

The KL divergence is non-negative and zero if and only if P=Q. It measures 
the coding inefficiency in bits from using a model Q to compress data when the 
true data distribution is P. , 

^ ^ This is an important link 

between machine learning, statistics, and information theory. An excellent text 
which elaborates on these relationships and many of the topics in this chapter 
is [1], 



1.3 Bayes Rule 



Bayes rule. 



P{y\x) 



P{x\y)P{y) 

P{x) 



(4) 



which follows from the equality P{x,y) = P{x)P{y\x) = P{y)P{x\y), can be 
used to motivate a coherent statistical framework for machine learning. The 
basic idea is the following. Imagine we wish to design a machine which has 
beliefs about the world, and updates these beliefs on the basis of observed data. 
The machine must somehow represent the strengths of its beliefs numerically. It 
has been shown that if you accept certain axioms of coherent inference, known 
as the , then a remarkable result follows [2]: If the machine is to 

represent the strength of its beliefs by real numbers, then the only reasonable 
and coherent way of manipulating these beliefs is to have them satisfy the rules 
of probability, such as Bayes rule. Therefore, P{X = x) can be used not only to 
represent the frequency with which the variable X takes on the value x (as in 
so-called frequentist statistics) but it can also be used to represent the degree 
of belief that X = x. Similarly, P{X = x\Y = y) can be used to represent the 
degree of belief that X = x given that one knows Y = y.^ 



® Another way to motivate the use of the rules of probability to encode degrees of belief 
comes from game-theoretic arguments in the form of the Dutch Book Theorem. This 
theorem states that if you are willing to accept bets with odds based on your degrees 
of beliefs, then unless your beliefs are coherent in the sense that they satisfy the rules 
of probability theory, there exists a set of simultaneous bets (called a “Dutch Book” ) 
which you will accept and which is guaranteed to lose you money, no matter what 
the outcome. The only way to ensure that Dutch Books don’t exist against you, is 
to have degrees of belief that satisfy Bayes rule and the other rules of probability 
theory. 
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From Bayes rule we derive the following simple framework for machine learn- 
ing. Assume a universe of models 17; let f? = {1, . . . , M} although it need not be 
finite or even countable. The machines starts with some prior beliefs over models 
m G 17 (we will see many examples of models later), such that P{^) = 1- 

A model is simply some probability distribution over data points, i.e. P(x\m). 
For simplicity, let us further assume that in all the models the data is taken to 
be independently and identically distributed (i.i.d.). After observing a data set 
V = {xi , . . . , a^Af}, the beliefs over models is given by: 

P{m\V) = ^ P{xn\m) (5) 

is the multiplied by the . 
over new data, which would be used to encode 

M 

P{x\V) = P{x\m)P{m\V) (6) 

m—1 

Again this follows from the rules of probability theory, and the fact that the 
models are assumed to produce i.i.d. data. 

Often models are defined by writing down a parametric probability distri- 
bution (again, we’ll see many examples below). Thus, the model m might have 
parameters 9, which are assumed to be unknown (this could in general be a vec- 
tor of parameters). To be a well-defined model from the perspective of Bayesian 
learning, one has to define a prior over these model parameters P{9\m) which 
naturally has to satisfy the following equality 

P{x\m) = j P{x\9,rn)P{9\m)d9 (7) 

Given the model m it is also possible to infer the posterior over the pa- 
rameters of the model, i.e. P{9\D ,m), and to compute the predictive distribu- 
tion, P{x\V ,m). These quantities are derived in exact analogy to equations (5) 
and (6), except that instead of summing over possible models, we integrate over 
parameters of a particular model. All the key quantities in Bayesian machine 
learning follow directly from the basic rules of probability theory. 

Certain approximate forms of Bayesian learning are worth mentioning. Let’s 
focus on a particular model m with parameters 6, and an observed data set T>. 
The predictive distribution averages over all possible parameters weighted by 
the posterior 

P{x\V,m)= j P{x\e)P{6\V,m)d9. (8) 

In certain cases, it may be cumbersome to represent the entire posterior 
distribution over parameters, so instead we will choose to find a 



which we read as the 
, normalized. 

The 

new data efficiently, is 
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of the parameters 0. A natural choice is to pick the most probable parameter 
value given the data, which is known as the or MAP 

parameter estimate 



^*MAP = argmaxP(6*|I?, to) = argmax 
Q 0 



\ogP{9\m) + log to) 



(9) 



Another natural choice is the 



or ML parameter estimate 



^ML = argmaxP(I?|0, to) = argmaxyjlogP(a;„|0,TO) (10) 

0 0 ' ^ 

n 

Many learning algorithms can be seen as finding ML parameter estimates. 
The ML parameter estimate is also acceptable from a frequentist statistical mod- 
eling perspective since it does not require deciding on a prior over parameters. 
However, ML estimation does not protect against overfitting — more complex 
models will generally have higher maxima of the likelihood. In order to avoid 
problems with overfitting, frequentist procedures often maximize a 
or log likelihood (e.g. [3]). If the penalty or regularization term is 

interpreted as a log prior, then maximizing penalized likelihood appears iden- 
tical to maximizing a posterior. However, there are subtle issues that make a 
Bayesian MAP procedure and maximum penalized likelihood different [4]. One 
difference is that the MAP estimate is not invariant to reparameterization, while 
the maximum of the penalized likelihood is invariant. The penalized likelihood is 
a function, not a density, and therefore does not increase or decrease depending 
on the Jacobian of the reparameterization. 



2 Latent Variable Models 

The framework described above can be applied to a wide range of models. No 
singe model is appropriate for all data sets. The art in machine learning is to 
develop models which are appropriate for the data set being analyzed, and which 
have certain desired properties. For example, for high dimensional data sets it 
might be necessary to use models that perform dimensionality reduction. Of 
course, ultimately, the machine should be able to decide on the appropriate 
model without any human intervention, but to achieve this in full generality 
requires significant advances in artificial intelligence. 

In this section, we will consider probabilistic models that are defined in terms 
of some latent or hidden variables. These models can be used to do dimensionality 
reduction and clustering, the two cornerstones of unsupervised learning. 

2.1 Factor Analysis 

Let the data set V consist of U-dimensional real valued vectors, I? = {yi , . . . , yAr}. 
In factor analysis, the data is assumed to be generated from the following model 



y = Ax -I- e 



( 11 ) 
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where x is a if-dimensional zero-mean unit-variance multivariate Gaussian vec- 
tor with elements corresponding to hidden (or latent) factors, A\s a, D x K ma- 
trix of parameters, known as the factor loading matrix, and e is a D-dimensional 
zero-mean multivariate Gaussian noise vector with diagonal covariance matrix 
']/. Defining the parameters of the model to be 0 = (S', A), by integrating out the 
factors, one can readily derive that 

p{y\^) = j p{yi\0)p{y\yi,0)dyi = U{Q,AA^ + ’!') ( 12 ) 

where A/"(/x, S) refers to a multivariate Gaussian density with mean /i and co- 
variance matrix S. For more details refer to [5]. 

Factor analysis is an interesting model for several reasons. If the data is very 
high dimensional (D is large) then even a simple model like the full-covariance 
multivariate Gaussian will have too many parameters to reliably estimate or 
infer from the data. By choosing K < D, factor analysis makes it possible to 
model a Gaussian density for high dimensional data without requiring 0{D^) 
parameters. Moreover, given a new data point, one can compute the posterior 
over the hidden factors, p(x\y,9); since x is lower dimensional than y this pro- 
vides a low-dimensional representation of the data (for example, one could pick 
the mean of p(x|y, 9) as the representation for y). 



2.2 Principal Components Analysis (PCA) 

Principal components analysis (PGA) is an important limiting case of factor 
analysis (FA). One can derive PGA by making two modifications to FA. First, 
the noise is assumed to be isotropic, in other words each element of e has equal 
variance: where I is a Dx D identity matrix. This model is called 

[6, 7]. Second, if we take the limit of cr ^ 0 in probabilistic PGA, 
we obtain standard PGA (which also goes by the names Karhunen-Loeve expan- 
sion, and singular value decomposition; SVD). Given a data set with covariance 
matrix S, for maximum likelihood factor analysis the goal is to find parameters 
A, and for which the model AA^ + W has highest likelihood. In PGA, the goal 
is to find A so that the likelihood is highest for AA^. Note that this matrix is 
singular unless K = D, so the standard PGA model is not a sensible model. 
However, taking the limiting case, and further constraining the columns of A to 
be orthogonal, it can be derived that the principal components correspond to 
the K eigenvectors with largest eigenvalue of S. PGA is thus attractive because 
the solution can be found immediately after eigendecomposition of the covari- 
ance. Taking the limit ct — > 0 of p(x|y, A, cr) we find that it is a delta-function 
at X = A^y, which is the projection of y onto the principal components. 



2.3 Independent Components Analysis (ICA) 

Independent components analysis (IGA) extends factor analysis to the case 
where the factors are non-Gaussian. This is an interesting extension because 
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many real-world data sets have structure which can be modeled as linear combi- 
nations of sparse sources. This includes auditory data, images, biological signals 
such as EEG, etc. simply corresponds to the assumption that the fac- 

tors have distributions with higher kurtosis that the Gaussian. For example, 
p{x) = |exp{— A|a;|} has a higher peak at zero and heavier tails than a Gaus- 
sian with corresponding mean and variance, so it would be considered sparse 
(strictly speaking, one would like a distribution which had non-zero probability 
mass at 0 to get true sparsity). 

Models like PGA, FA and IGA can all be implemented using neural networks 
(multi-layer perceptrons) trained using various cost functions. It is not clear 
what advantage this implementation/interpretation has from a machine learn- 
ing perspective, although it provides interesting ties to biological information 
processing. 

Rather than ML estimation, one can also do Bayesian inference for the pa- 
rameters of probabilistic PGA, FA, and IGA. 

2.4 Mixture of Gaussians 

The densities modeled by PGA, FA and IGA are all relatively simple in that they 
are unimodal and have fairly restricted parametric forms (Gaussian, in the case 
of PGA and FA). To model data with more complex structure such as clusters, 
it is very useful to consider mixture models. Although it is straightforward to 
consider mixtures of arbitrary densities, we will focus on Gaussians as a common 
special case. The density of each data point in a mixture model can be written: 

K 

p{y\^) = ^T^kp{y\0k) (13) 

fc=i 

where each of the K components of the mixture is, for example, a Gaussian with 
differing means and covariances 9^ = {pk,^k) and is the mixing proportion 
for component k, such that J2k=i = 1 and > 0, Vfc. 

A different way to think about mixture models is to consider them as latent 
variable models, where associated with each data point is a AT-ary discrete latent 
(i.e. hidden) variable s which has the interpretation that s = fc if the data point 
was generated by component k. This can be written 

K 

p{y\Q) = ^Pi.s = k\'K)p{y\s = k,6) (14) 

where P{s = fc|7r) = tt^, is the prior for the latent variable taking on value 
fc, and p(y|s = k,9) = p{y\9k) is the density under component k, recovering 
Equation (13). 

2.5 K- Means 

The mixture of Gaussians model is closely related to an unsupervised clustering 
algorithm known as fc-means as follows: Gonsider the special case where all the 
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Gaussians have common covariance matrix proportional to the identity matrix: 
Sk = (7^/, Vfc, and let = 1/K, \/k. We can estimate the maximum likelihood 
parameters of this model using the iterative algorithm which we are about to 
describe, known as EM. The resulting algorithm, as we take the limit ^ 0, 
becomes exactly the fc-means algorithm. Clearly the model underlying fc-means 
has only singular Gaussians and is therefore an unreasonable model of the data; 
however, fc-means is usually justified from the point of view of clustering to 
minimize a distortion measure, rather than fitting a probabilistic models. 



3 The EM Algorithm 

The EM algorithm is an algorithm for estimating ML parameters of a model 
with latent variables. Consider a model with observed variables y, hidden/latent 
variables x, and parameters 6. We can lower bound the log likelihood for any 
data point as follows 



L{9) = logp(y|6») = log J 


p(x,y|6») dx 


(15) 


= i„gj 


' ^ ^p(x,y|6») 
q{x) dx 

dW 


(16) 


>/,(: 


N, P(x,y|6») , def 
X log dx = F{q,9) 


(17) 



where q(x) is some arbitrary density over the hidden variables, and the lower 
bound holds due to the concavity of the log function (this inequality is known 
as Jensen’s inequality). The lower bound E is a functional of both the density 
(/(x) and the model parameters 6. For a data set of N data points 
this lower bound is formed for the log likelihood term corresponding to each 
data point, thus there is a separate density for each point and F{q,9) = 

The basic idea of the Expectation-Maximization (EM) algorithm is to iterate 
between optimizing this lower bound as a function of q and as a function of 9. We 
can prove that this will never decrease the log likelihood. After initializing the 
parameters somehow, the fc*^ iteration of the algorithm consists of the following 
two steps: 



E Step: 


F , 


q . 






gfc(x) = argmax j 


f t p(x,y|dfc-i) 


(18) 




dfc(x) =p{x\y,9k- 


-i) 


(19) 


M Step: 


F , 
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9 k = argmax j 


f 9fc(x)log ^ c?x 

ft(x) 


(20) 


9k = argmax i 


f 9fc(x)logp(x,y|6») dx 


(21) 



Let us be absolutely clear what happens for a data set of N data points: 
In the E step, for each data point, the distribution over the hidden variables is 
set to the posterior for that data point = p(x|y^”\ 6<fc_i), Vn. In the M 

step the single set of parameters is re-estimated by maximizing the sum of the 
expected log likelihoods: 9k = argmaxg^^ / <7^"^(x) logp(x, |0) dx. 

Two things are still unclear: how does (19) follow from (18), and how is this 
algorithm guaranteed to increase the likelihood? The optimization in (18) can 
be written as follows since p(x,y|0fe_i) = p{y\9k-i)p{'x\y ,9k-i)- 



qk{y^) = argmax 
<?(x) 



logp(y|6»fc_i) 



g(x) log 



p(x|y,6lfc_i) 

?(x) 



dx 



(22) 



Now, the first term is a constant w.r.t. g(x) and the second term is the 
negative of the Kullback-Leibler divergence 

KL((7 (x) ||p(x|y , 9k-i)) = j (?(x) log p(x|y^i i) 

which we have seen in Equation (3) in its discrete form. This is minimized at 
q(x) = p(x|y,dfc_i), where the KL divergence is zero. Intuitively, the interpre- 
tation of this is that in the E step of EM, the goal is to find the posterior 
distribution of the hidden variables given the observed variables and the current 
settings of the parameters. We also see that since the KL divergence is zero, at 
the end of the E step, F{qk,9k-i) = L{9k-i). 

In the M step, F is increased with respect to 9. Therefore, F{qk,9k) > 
F{qk,9k-i). Moreover, L{9k) = F{qk+i,9k) > F{qk,9k) after the next E step. 
We can put these steps together to establish that L{9k) > L{9k-i), establishing 
that the algorithm is guaranteed to increase the likelihood or keep it fixed (at 
convergence). 

The EM algorithm can be applied to all the latent variable models described 
above, i.e. FA, probabilistic PCA, mixture models, and ICA. In the case of mix- 
ture models, the hidden variable is the discrete assignment s of data points to 
clusters; consequently the integrals turn into sums where appropriate. EM has 
wide applicability to latent variable models, although it is not always the fastest 
optimization method [8]. Moreover, we should note that the likelihood often has 
many local optima and EM will converge some local optimum which may not 
be the global one. 

EM can also be used to estimate MAP parameters of a model, and as we will 
see in Section 11.4 there is a Bayesian generalization of EM as well. 
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4 Modeling Time Series and Other Structured Data 

So far we have assumed that the data is , that is, the observations 

are assumed to be independent and identically distributed. This assumption is 
unreasonable for many data sets in which the observations arrive in a sequence 
and subsequent observations are correlated. Sequential data can occur in time 
series modeling (as in financial data or the weather) and also in situations where 
the sequential nature of the data is not necessarily tied to time (as in protein 
data which consist of sequences of amino acids) . 

As the most basic level, time series modeling consists of building a probabilis- 
tic model of the present observation given all past observations 
p(yt|yt-i,yt-2 • ■ •)• Because the history of observations grows arbitrarily large 
it is necessary to limit the complexity of such a model. There are essentially two 
ways of doing this. 

The first approach is to limit the window of past observations. Thus one can 
simply model p(yt|yt_i) and assume that this relation holds for all t. This is 
known as a first-order Markov model. A second-order Markov model would be 
p(yt|yt_i,yt_2), and so on. Such Markov models have two limitations: First, 
the influence of past observations on present observations vanishes outside this 
window, which can be unrealistic. Second, it may be unnatural and unwieldy to 
model directly the relationship between raw observations at one time step and 
raw observations at a subsequent time step. For example, if the observations 
are noisy images, it would make more sense to de-noise them, extract some 
description of the objects, motions, illuminations, and then try to predict from 
that. 

The second approach is to make use of latent or hidden variables. Instead of 
modeling directly the effect of yt_i on yt, we assume that the observations were 
generated from some underlying hidden variable xt which captures the dynamics 
of the system. For example, y might be noisy sonar readings of objects in a room, 
while X might be the actual locations and sizes of these objects. We usually call 
this hidden variable x the since it is meant to capture all the 

aspects of the system relevant to predicting the future dynamical behavior of 
the system. 

In order to understand more complex time series models, it is essential that 
one be familiar with state-space models (SSMs) and hidden Markov models 
(HMMs). These two classes of models have played a historically important role 
in control engineering, visual tracking, speech recognition, protein sequence mod- 
eling, and error decoding. They form the simplest building blocks from which 
other richer time-series models can be developed, in a manner completely anal- 
ogous to the role that FA and mixture models play in building more complex 
models for i.i.d. data. 



4.1 State-Space Models (SSMs) 

In a state-space model, the sequence of observed data yi, y 2 ,ys, . . . is assumed to 
have been generated from some sequence of hidden state variables xi, X2, X3, . . .. 
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Letting xi;t denote the sequence xi, . . . ,xt, the basic assumption in an SSM 
is that the joint probability of the hidden states and observations factors in the 
following way: 



T 

p(xi:T,yi:T|0) = p(xt |xt_i , 6»)p(yt |x( , 6») (24) 

i=l 

In order words, the observations are assumed to have been generated from 
the hidden states via p(yt|x(,0), and the hidden states are assumed to have 
first-order Markov dynamics captured by p(xt|xt_i, 9). We can consider the first 
term p(xi|xo, 9) to be a prior on the initial state of the system xi. 

The simplest kind of state-space model assumes that all variables are multi- 
variate Gaussian distributed and all the relationships are linear. In such 

, we can write 



yt = Cxt + (25) 

xt = Axt-i + wt (26) 

where the matrices C and A define the linear relationships and v and w are zero- 
mean Gaussian noise vectors with covariance matrices R and Q respectively. If 
we assume that the prior on the initial state p(xi) is also Gaussian, then all 
subsequent xs and ys are also Gaussian due the the fact that Gaussian densities 
are closed under linear transformations. This model can be generalized in many 
ways, for example by augmenting it to include a sequence of observed inputs 
ui, . . . , ut as well as the observed model outputs yi, . . . ,yr, but we will not 
discuss generalizations further. 

By comparing equations (11) and (25) we see that linear-Gaussian SSMs can 
be thought of as a time-series generalization of factor analysis where the factors 
are assumed to have linear-Gaussian dynamics over time. 

The parameters of this model are 9 = {A,C,Q, R). To learn ML settings of 
these parameters one can make use of the EM algorithm [9]. The E step of the 
algorithm involves computing ^(xi:^) = p(xi:T|yi:T) ^) which is the posterior 
over hidden state sequences. In fact, this whole posterior does not have to be 
computed or represented, all that is required are the marginals q(xt) and pair- 
wise marginals ( 7 (xt,xt+i). These can be computed via the , 

, which is an efficient algorithm for inferring the distribution over the 
hidden states of a linear-Gaussian SSM. Since the model is linear, the M step of 
the algorithm requires solving a pair of weighted linear regression problems to 
re-estimate A and C, while Q and R are estimated from the residuals of those 
regressions. This is analogous to the M step of factor analysis, which also involves 
solving a linear regression problem. 

4.2 Hidden Markov Models (HMMs) 

Hidden Markov models are similar to state-space models in that the sequence of 
observations is assumed to have been generated from a sequence of underlying 
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hidden states. The key difference is that in HMMs the state is assumed to be 
rather than a continuous random vector. Let St denote the hidden state 
of an HMM at time t. We assume that St can take discrete values in {1, ... , K}. 
The model can again be written as in (24): 

T 

P{si-.T,yi:T\e) = l[P{st\st-i,9)P{yt\sue) (27) 

where P(si|so,0) is simply some initial distribution over the K settings of the 
first hidden state; we can call this discrete distribution tt, represented by a iL x 1 
vector. The state-transition probabilities P{st\st-i,9) are captured hy a, K x K 
transition matrix A, with elements Aij = P{st = i\st-i = j, 0)- The observations 
in an HMM can be either continuous or discrete. For continuous observations 
yt one can for example choose a Gaussian density; thus p{yt\st = i,9) would 
be a different Gaussian for each choice oft G {1, . . . , K}. This model is the 
dynamical generalization of a mixture of Gaussians. The marginal probability 
at each point in time is exactly a mixture of K Gaussians — the difference is 
that which component generates data point yt and which component generated 
yt-i are not independent random variables, but certain combinations are more 
and less probable depending on the entries in A. For y* a discrete observation, 
let us assume that it can take on values {!,..., L}. In that case the output 
probabilities P{yt\st,0) can be captured hy an L x K emission matrix, E. 

The model parameters for a discrete-observation HMM are 6 = {tt,A,E). 
Maximum likelihood learning of the model parameters can be approached us- 
ing the EM algorithm, which in the case of HMMs is known as the 

. The E step involves computing Q{st) and (3(st,st+i) which 
are marginals of Q{si-t) = P{si-.T\yi:T,0)- These marginals are computed as 
part of the ^ ^ which as the name suggests sweeps for- 

ward and backward through the time series, and applies Bayes rule efficiently 
using the Markov conditional independence properties of the HMM, to compute 
the required marginals. The M step of HMM learning involves re-estimating tt, 
A, and E by adding up and normalizing expected counts for transitions and 
emissions that were computed in the E step. 

4.3 Modeling Other Structured Data 

We have considered the case of i.i.d. data and time series data. The observations 
in real world data sets can have many other possible structures as well. Let us 
mention a few examples, although it is not possible to strive for completeness. 

In spatial data, the points are assumed to live in some metric, often Euclidean, 
space. Three examples of spatial data include epidemiological data which can 
be modeled as a function of the spatial location of the measurement; data from 
computer vision where the observations are measurements of features on a 2D 
input to the camera; and functional neuroimaging where the data can be phys- 
iological measurements related to neural activity located in 3D voxels defining 
coordinates in the brain. Generalizing HMMs, one can define Markov random 
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field models where there are a set of hidden variables correlated to neighbors in 
some lattice, and related to the observed variables. 

Hierarchical or tree-structured data contains known or unknown tree-like 
correlation structure between the data points or measured features. For example, 
the data points may be features of animals related through an evolutionary 
tree. A very different form of structured data is if each data point itself is tree- 
structured, for example if each point is a parse tree of a sentence in the English 
language. 

Finally, one can take the structured dependencies between variables and con- 
sider the structure itself as an unknown part of the model. Such models are 
known as . and are closely related to graphical mod- 

els which we will discuss in Section 7. 



5 Nonlinear, Factorial, and Hierarchical Models 

The models we have described so far are attractive because they are relatively 
simple to understand and learn. However, their simplicity is also a limitation, 
since the intricacies of real-world data are unlikely to be well-captured by a 
simple statistical model. This motivates us to seek to describe and study learning 
in much more flexible models. 

A simple combination of two of the ideas we have described for i.i.d. data 
is the [10, 11, 12]. This model performs simultane- 

ous clustering and dimensionality reduction on the data, by assuming that the 
covariance in each Gaussian cluster can be modeled by an FA model. Thus, it 
becomes possible to apply a mixture model to very high dimensional data while 
allowing each cluster to span a different sub-space of the data. 

As their name implies linear-Gaussian SSMs are limited by assumptions of 
linearity and Gaussian noise. In many realistic dynamical systems there are 
significant nonlinear effects, which make it necessary to consider learning in 

. Such models can also be learned using the EM 
algorithm, but the E step must deal with inference in non-Gaussian and poten- 
tially very complicated densities (since non-linearities will turn Gaussians into 
non-Gaussians), and the M step is nonlinear regression, rather than linear regres- 
sion [13]. There are many methods of dealing with inference in non-linear SSMs, 
including methods such as particle filtering [14, 15, 16, 17, 18, 19], linearization 
[20], the unscented filter [21, 22], the EP algorithm [23], and embedded HMMs 
[24]. 

Non-linear models are also important if we are to consider generalizing sim- 
ple dimensionality reduction models such as PGA and FA. These models are 
limited in that they can only find a linear subspace of the data to capture the 
correlations between the observed variables. There are many interesting and 
important nonlinear dimensionality reduction models, including generative to- 
pographic mappings (GTM) [25] (a probabilistic alternative to Kohonen maps), 
multi-dimensional scaling (MDS) [26, 27], principal curves [28], Isomap [29], and 
locally linear embedding (LLE) [30]. 
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Hidden Markov models also have their limitations. Even though they can 
model nonlinear dynamics by discretizing the hidden state space, an HMM with 
K hidden states can only capture log 2 K bits of information in its state variable 
about the past of the sequence. HMMs can be extended by allowing a 
of discrete state variables, in an architecture known as a [31]. 

Thus a vector of M variables, each of which can take K states, can capture 
possible states in total, and M log 2 K bits of information about the past of the 
sequence. The problem is that such a model, if dealt with naively as an HMM 
would have exponentially many parameters and would take exponentially long 
to do inference in. Both the complexity in time and number of parameters can 
be alleviated by restricting the interactions between the hidden variables at one 
time step and at the next time step. A generalization of these ideas is the notion 
of a . ^ ( ) [32]. 

A relatively old but still quite powerful class of models for binary data is 
the (BM) [33]. This is a simple model inspired from Ising 

models in statistical physics. A BM is a multivariate model for capturing cor- 
relations and higher order statistics in vectors of binary data. Consider data 
consisting of vectors of M binary variables (t he elements of the vector may, for 
example, be pixels in a black-and-white image). Clearly, each data point can be 
an instance of one of 2^ possible patterns. An arbitrary distribution over such 
patterns would require a table with 2^ — 1 entries, again intractable in num- 
ber of parameters, storage, and computation time. A BM allows one to define 
flexible distributions over the 2^ entries of this table by using 0{M‘^) parame- 
ters defining a symmetric matrix of weights connecting the variables. This can 
be augmented with hidden variables in order to enrich the model class, without 
adding exponentially many parameters. These hidden variables can be organized 
into layers of a hierarchy as in the Helmholtz machine [34]. Other hierarchical 
models include recent generalizations of ICA designed to capture higher order 
statistics in images [35]. 



6 Intractability 

The problem with the models described in the previous section is that learn- 
ing their parameters is in general computationally intractable. In a model with 
exponentially many settings for the hidden states, doing the E step of an EM 
algorithm would require computing appropriate marginals of a distribution over 
exponentially many possibilities. 

Let us consider a simple example. Imagine we have a vector of N binary 
random variables s = (si, . . . , sjv)> where Sj G {0, 1} and a vector of N known 
integers (n, . . . , tn) where G {1,2,3,..., 10}. Let the variable Y = GSi. 
Assume that the binary variables are all independent and identically distributed 
with P{si = 1) = 1/2, VL Let N be 100. Now imagine that we are told Y = 430. 
How do we compute P{si = l|y = 430)? The problem is that even though the 
Si were independent we observed the value of Y , now that we know the 

value of Y , not all settings of s are possible anymore. To figure out for some Sj 
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the probability of P{si = 1\Y = 430) requires that we enumerate all potentially 
exponentially many ways of achieving Y = 430 and counting how many of those 
had Sj = 1 vs Si = 0. 

This example illustrates the following ideas: Even if the prior is simple, the 
posterior can be very complicated. Whether two random variables are indepen- 
dent or not is a function of one’s state of knowledge. Thus Si and sj may be 
independent if we are not told the value of Y but are certainly dependent given 
the value of Y. These type of phenomena are related to “explaining-away” which 
refers to the fact that if there are multiple potential causes for some effect, ob- 
serving one, explains away the need for the others [36]. 

Intractability can thus occur if we have a model with discrete hidden vari- 
ables which can take on exponentially many combinations. Intractability can 
also occur with continuous hidden variables if their density is not simply de- 
scribed, or if they interact with discrete hidden variables. Moreover, even for 
simple models, such as a mixture of Gaussians, intractability occurs when we 
consider the parameters to be unknown as well, and we attempt to do Bayesian 
inference on them. To deal with intractability it is essential to have good tools 
for representing multivariate distributions, such as graphical models. 



7 Graphical Models 

Graphical models are an important tool for representing the dependencies be- 
tween random variables in a probabilistic model. They are important for two 
reasons. First, graphs are an intuitive way of visualizing dependencies. We are 
used to graphical depictions of dependency, for example in circuit diagrams and 
in phylogenetic trees. Second, by exploiting the structure of the graph it is pos- 
sible to devise efficient message passing algorithms for computing marginal and 
conditional probabilities in a complicated model. We discuss message passing 
algorithms for inference in Section 8. 

The main statistical property represented explicitly by the graph is 

between variables. We say that X and Y are conditionally 
independent given Z, if P{X,Y\Z) = P{X\Z)P(Y\Z) for all values of the vari- 
ables X,Y, and Z where these quantities are defined (i.e. excepting settings 
z where P{Z = z) = 0). We use the notation XALY\Z to denote the above 
conditional independence relation. Gonditional independence generalists to sets 
of variables in the obvious way, and it is different from 
which states that P{X,Y) = P{X)P(Y), and is denoted XALY. 

There are several different graphical formalisms for depicting conditional in- 
dependence relationships. We focus on three of the main ones: undirected, factor, 
and directed graphs. 

7.1 Undirected Graphs 

In an each random variable is represented by a node, 

and the edges of the graph indicate conditional independence relationships. 
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Fig. 1. Three kinds of probabilistic graphical model: undirected graphs, factor graphs 
and directed graphs 



Specifically, let X, y, and Z be sets of random variables. Then XJLyjZ if every 
path on the graph from a node in to a node in y has to go through a node in 
Z. Thus a variable X is conditionally independent of all other variables given the 
neighbors of X, and we say that the neighbors X from the rest of the 

graph. An example of an undirected graph is shown in Figure 1. In this graph 
AALB\C and BALE\{C, £>}, for example, and the neighbors of D are B, C, E. 

A is a fully connected subgraph of a graph. A is not 

contained in any other clique of the graph. It turns out that the set of condi- 
tional independence relations implied by the separation properties in the graph 
are satisfied by probability distributions which can be written as a normalized 
product of non-negative functions over the variables in the maximal cliques of the 
graph (this is known as the Hammersley-Clifford Theorem [37]). In the example 
in Figure 1, this implies that the probability distribution over {A, B,C, D, E) 
can be written as: 



P{A, B, C, D,E)=c 31 (A, C)g 2 {B, C, D)g:,{C, D, E) (28) 

Here, c is the constant that ensures that the probability distribution sums to 
1, and 3i, 32 and 33 are non-negative functions of their arguments. For example, 
if all the variables are binary the function 32 is a table with a non-negative 
number for each of the 8 = 2 x 2 x 2 possible settings of the variables B,C,D. 
These non-negative functions are supposed to represent how compatible these 
settings are with each other, with a 0 encoding logical incompatibility. For this 
reason, the g’s are sometimes referred to as . , other times 

as . Undirected graphical models are also sometimes referred 

to as ^ . 

7.2 Factor Graphs 

In a , there are two kinds of nodes, and , 

usually denoted as open circles and filled dots (Figure 1). Like an undirected 
model, the factor graph represents a factorization of the joint probability distri- 
bution: each factor is a non-negative function of the variables connected to the 
corresponding factor node. Thus for the factor graph in Figure 1 we have: 
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P{A, B, C, D, E) = c5i(A, C)52(B, C)g:i{B, D),gi{C, D)g^{C, E)g^{D, E) 

(29) 

Factor nodes are also sometimes called function nodes. Again, as in an undi- 
rected graphical model, the variables in a set X are conditionally independent 
of the variables in a set y given Z if all paths from A to 3^ go through variables 
in Z. Note that the factor graph is Figure 1 has exactly the same conditional 
independence relations as the undirected graph, even though the factors in the 
former are contained in the factors in the latter. Factor graphs are particularly 
elegant and simple when it comes to implementing message passing algorithms 
for inference (Section 8). 

7.3 Directed Graphs 

In , also known as probabilistic directed acyclic graphs 

(DAGs), belief networks, and Bayesian networks, the nodes represent random 
variables and the directed edges represent statistical dependencies. If there exists 
an edge from A to i? we say that A is a of i?, and conversely i? is a , 

of A. A directed graph corresponds to the factorization of the joint probability 
into a product of the conditional probabilities of each node given its parents. For 
the example in Figure 1 we write: 

P(A, B, C, D, E) = P{A)P{B)P{C\A, B)P{D\B, C)P{E\C, D) (30) 
In general we would write: 



N 

P{X,,...,XM) = l[P{Xi\Xp,J (31) 

i=l 

where Xp^^ denotes the variables that are parents of Xi in the graph. 

Assessing the conditional independence relations in a directed graph is slightly 
less trivial than in undirected and factor graphs. Rather than simply looking 
at separation between sets of variables, one has to consider the directions of 
the edges. The graphical test for two sets of variables being conditionally in- 
dependent given a third is called [36]. D-separation takes into ac- 
count the following fact about of the graph, which consist of two 

(or more) parents of a child, as in the A ^ C ■(— B subgraph in Figure 1. 
In such a v-structure AALB, but it is not true that AALB\C. That is, A and 
B are marginally independent, but conditionally given C . This can 

be easily checked by writing out P{A,B,C) = P{A)P{B)P{C\A,B). Sum- 
ming out C leads to P{A,B) = P{A)P{B). However, given the value of C, 
P(A,B\C) = P{A)P{B)P{C\A, B) / P{C) which does not factor into separate 
functions of A and B. As a consequence of this property of v-structures, in a 
directed graph a variable X is independent of all other variables given the par- 
ents of X, the children of X, and the parents of the children of X. This is the 
minimal set that d-separates X from the rest of the graph and is known as the 
Markov boundary for X. 
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It is possible, though not always appropriate, to interpret a directed graphical 
model as a causal generative model of the data. The following procedure would 
generate data from the probability distribution defined by a directed graph: draw 
a random value from the marginal distribution of all variables which do not 
have any parents (e.g. a ~ P{A), b ~ P{B)), then sample from the conditional 
distribution of the children of these variables (e.g. c ~ P{C\A = a, B = a)), 
and continue this procedure until all variables are assigned values. In the model, 
P{C\A, B) can capture the causal relationship between the causes A and B and 
the effect C. Such causal interpretations are much less natural for undirected and 
factor graphs, since even generating a sample from such models cannot easily be 
done in a hierarchical manner starting from “parents” to “children” except in 
special cases. Moreover, the potential functions capture mutual compatibilities, 
rather than cause-effect relations. 

A useful property of directed graphical models is that there is no global nor- 
malization constant c. This global constant can be computationally intractable to 
compute in undirected and factor graphs. In directed graphs, each term is a con- 
ditional probability and is therefore already normalized P{Xi = x\Xpg,J = 1. 

7.4 Expressive Power 

Directed, undirected and factor graphs are complementary in their ability to ex- 
press conditional independence relationships. Consider the directed graph con- 
sisting of a single v-structure A — > C <— S. This graph encodes AILB but not 
A1LB\C. There exists no undirected graph or factor graph over these three vari- 
ables which captures exactly these independencies. For example, in A — C — B 
it is not true that AILB but it is true that AALB\C. Conversely, if we consider 
the undirected graph in Figure 2, we see that some independence relationships 
are better captured by undirected models (and factor graphs). 




Fig. 2. No directed graph over 4 variables can represent the set of conditional inde- 
pendence relationships represented by this undirected graph 

8 Exact Inference in Graphs 

Probabilistic in a graph usually refers to the problem of computing the 

conditional probability of some variable Xi given the observed values of some 
other variables Aobs = sJobs while marginalizing out all other variables. Starting 
from a joint distribution P{Xi, . . . ,Xm), we can divide the set of all variables 
into three exhaustive and mutually exclusive sets {Xi, . . . Ajv} = {Xi} U Aobs U 
Aother- We wish to compute 
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-P(-^z|-^obs — ^obs) 



Mother — ^5 Vobs — ^obs) 

— X , Vother — X^ Vobs — ^obs) 



(32) 



The problem is that the sum over x is exponential in the number of variables 
in Vother- For example, if there are M variables in Vother and each is binary, 
then there are 2^ possible values for x. If the variables are continuous, then 
the desired conditional probability is the ratio of two high-dimensional integrals, 
which could be intractable to compute. Probabilistic inference is essentially a 
problem of computing large sums and integrals. 

There are several algorithms for computing these sums and integrals which 
exploit the structure of the graph to get the solution efficiently for certain graph 
structures (namely trees and related graphs). For general graphs the problem is 
fundamentally hard [38]. 



8.1 Elimination 

The simplest algorithm conceptually is . . . It is easiest to ex- 

plain with an example. Consider computing P(A = a\D = d) in the directed 
graph in Figure 1. This can be written 

P{A = a\D = d) (X P{A = a,B = b,C = c,D = d,E = e) 

c b e 

= EEE P{A = a)P{B = b)P{C = c\A = a,B = b) 

c h e 

P{D = d\C = c,B = b)P{E = e\C = c,D = d) 

= EE P{A = a)P{B = b)P{C = c\A = a,B = b) 

c b 

P{D = d\C = c,B = b)J2 = e\C = c,D = d) 

e 

= EE P{A = a)P{B = b)P{C = c\A = a,B = b) 

c b 

P{D = d\C = c,B = b) 

What we did was (1) exploit the factorization, (2) rearrange the sums, and 
(3) eliminate a variable, E. We could repeat this procedure and eliminate the 
variable C. When we do this we will need to compute a new function (j)(A = 
a,B = b,D = d) = = c\A = a, B = b)P{D = d\C = c,B = b), 

resulting in: 

P{A = a\D = d) (X P{A = a)P{B = b)4>{A = a, B = b, D = d) 

b 

Finally, we eliminate B by computing (j)'{A = a,D = d) = ^hP{B = 
b)(j){A = a, B = b, D = d) to get our final answer which can be written 

P{A = a)(j)'{A = a, D = d) 
Eo = a)(l)'{A = a,D = d) 



P{A = a\D = d) (X P{A = a)(j) {A = a, D = d) 
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The functions we get when we eliminate variables can be thought of as mes- 
sages sent by that variable to its neighbors. Eliminating transforms the graph by 
removing the eliminated node and drawing (undirected) edges between all the 
nodes in the Markov boundary of the eliminated node. 

The same answer is obtained no matter what order we eliminate variables in; 
however, the computational complexity can depend dramatically on the ordering 
used. 

8.2 Belief Propagation 

The belief propagation (BP) algorithm is a message passing algorithm for com- 
puting conditional probabilities of any variable given the values of some set of 
other variables in a directed acyclic graph [36]. The algorithm 

itself follows from the rules of probability and the conditional independence 
properties of the graph. Whereas variable elimination focuses on finding the 
conditional probability of a single variable Xi given Xobs = a^obs> belief propa- 
gation can compute at once all the conditionals p{Xi\Xohs = a^obs) for all i not 
observed. 

We first need to define singly-connected directed graphs. A directed graph 
is singly connected if between every pair of nodes there is only one undirected 
path. An , is a path along the edges of the graph ignoring the 

direction of the edges: in other words the path can traverse edges both upstream 
and downstream. If there is more than one undirected path between any pair 
of nodes then the graph is said to be , or (since it has 

loops). 

Singly connected graphs have an important property which BP exploits. Let 
us call the set of observed variables the , e = Aobs- Every node in 

the graph divides the evidence into upstream and downstream parts. 
For example, in Figure 3 the variables U\ . . .Un their parents, ancestors, and 
children and descendents (not including A, its children and descendents) and 
anything else connected to X via an edge directed toward X are all considered 
to be of A; anything connected to A via an edge away from A is 

considered ^ of A (e.g. Yi, its children, the parents of its children, 

etc). Similarly, every edge A ^ Y in a singly connected graph divides the 




Fig. 3. Belief propagation in a directed graph 
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evidence into upstream and downstream parts. This separation of the evidence 
into upstream and downstream components does not generally occur in multiply- 
connected graphs. 

Belief propagation uses three key ideas to compute the probability of some 
variable given the evidence p{X\e), which we can call the “belief” about X.'^ 
First, the belief about X can be found by combining upstream and downstream 
evidence: 

Pirn = (X P(X,e+,e^) (X P{X\e+)P{e-^\X) (33) 

The last proportionality results from the fact that given X the downstream 
and upstream evidence are conditionally independent: P{e^ \X,e+) = P{e-^\X). 
Second, the effect of the upstream and downstream evidence on X can be com- 
puted via a local message passing algorithm between the nodes in the graph. 
Third, the message from X to Y has to be constructed carefully so that node X 
doesn’t send back to Y any information that Y sent to X, otherwise the message 
passing algorithm would reverberate information between nodes amplifying and 
distorting the final beliefs. 

Using these ideas and the basic rules of probability we can arrive at the 
following equations, where ch(X) and pa(X) are children and parents of X, 
respectively: 

A(X)1^'p(e3^|X)= n P(^~XY,\X) (34) 

iech(x) 

4X)‘(i='p(X|e+) = ^P(X|t/i,...,C/„) n PiUH^x) (35) 

Ui...Un iGpa(X) 

Finally, the messages from parents to children (e.g. X to Yj) and the messages 
from children to parents (e.g. X to Ui) can be computed as follows: 

7Ty^.(X)1i'p(X|e+yP 

« [n^(exyj^)] (36) 

Mi i 

Xxm=P(eu,xm 

= J2P((^x\X)J2P{X\U^...U^)l[P{Uk\e+^x) (37) 

JC Uk-k^i k^i 

It is important to notice that in the computation of both the top-down mes- 
sage (36) and the bottom-up message (37) the recipient of the message is explic- 
itly excluded. Pearl’s [36] mnemonic of calling these messages A and tt messages 
is meant to reflect their role in computing “likelihood” and “prior” terms. 



^ There is considerably variety in the field regarding the naming of algorithms. Belief 
propagation is also known as the sum-product algorithm, a name which some people 
prefer since beliefs seem subjective. 
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BP includes as special cases two important algorithms: Kalman smoothing 
for linear-Gaussian state-space models, and the forward-backward algorithm for 
hidden Markov models. Although BP is only valid on singly connected graphs 
there is a large body of research on its application to multiply connected graphs — 
the use of BP on such graphs is called and has been 

analyzed by several researchers [39, 40]. Interest in loopy belief propagation 
arose out of its impressive performance in decoding error correcting codes [41, 
42, 43, 44]. Although the beliefs are not guaranteed to be correct on loopy 
graphs, interesting connections can be made to approximate inference procedures 
inspired by statistical physics known as the Bethe and Kikuchi free energies [45] . 

8.3 Factor Graph Propagation 

In belief propagation, there is an asymmetry between the messages a child sends 
its parents and the messages a parent sends its children. Propagation in singly- 
connected factor graphs is conceptually much simpler and easier to implement. 
In a factor graph, the joint probability distribution is written as a product of 
factors. Consider a vector of variables x = {x \, . . . , cc„) 

p(x) =p{xi,...,Xn) = (38) 

3 

where Z is the normalisation constant, Sj denotes the subset of {1, . . . , n} which 
participate in factor fj and X 5 ^ = {xi : i € Sj}. 

Let n(a;) denote the set of factor nodes that are neighbours of x and let 
n(/) denote the set of variable nodes that are neighbours of /. We can compute 
probabilities in a factor graph by propagating messages from variable nodes to 
factor nodes and vice-versa. The message from variable x to function / is: 

Px^f{x) = ph^x{x) (39) 

ft.en(a:)\{/} 

while the message from function / to variable x is: 

Pf^x{x) = '^i.f{x) ^j^_/(?/)j (40) 

x\a: \ yen(/)\{x} J 

Once a variable has received all messages from its neighbouring factor nodes 
we can compute the probability of that variable by multiplying all the messages 
and renormalising: 

p{x) (X n Ph^x{x) (41) 

h^n{x) 

Again, these equations can be derived by using Bayes rule and the condi- 
tional independence relations in a singly-connected factor graph. For multiply- 
connected factor graphs (where there is more than one path between at least one 
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pair of variable nodes) one can apply a loopy version of factor graph propagation. 
Since the algorithms for directed graphs and factor graphs are essentially based 
on the same ideas, we also call the loopy version of factor graph propagation 
“loopy belief propagation”. 

8.4 Junction Tree Algorithm 

For multiply-connected graphs, the standard exact inference algorithms are based 
on the notion of a [46]. The basic idea of the junction tree algo- 

rithm is to group variables so as to convert the multiply-connected graph into a 
singly-connected undirected graph (tree) over sets of variables, and do inference 
in this tree. 

We will not explain the algorithm in detail here, but rather give an overview 
of the steps involved. Starting from a directed graph, undirected edges are in- 
troduced between every pair of variables that share a child. This step is called 
“moralisation” in a tongue-in-cheek reference to the fact that it involves mar- 
rying the unmarried parents of every node. All the remaining edges are then 
changed from directed to undirected. We now have an undirected graph which 
does not imply any additional conditional or marginal independence relations 
which were not present in the original directed graph (although the undirected 
graph may easily have many fewer conditional or marginal independence rela- 
tions than the directed graph). The next step of the algorithm is “triangula- 
tion” which introduces an edge cutting across every cycle of length 4. For ex- 
ample, the cycle A — B — C — D — A which would look like Figure 2 would 
be triangulated either by adding an edge A — C or an edge B — D. Once 
the graph has been triangulated, the maximal cliques of the graph are or- 
ganised into a tree, where the nodes of the tree are cliques, by placing edges 
in the tree between some of the cliques with an overlap in variables (plac- 
ing edges between all overlaps may not result in a tree). In general it may be 
possible to build several trees in this way, and triangulating the graph means 
than there exists a tree with the “running intersection property” . This prop- 
erty ensures that none of the variable is represented in disjoint parts of the 
tree, as this would cause the algorithm to come up with multiple possibly 
inconsistent beliefs about the variable. Finally, once the tree with the run- 
ning intersection property is built (the junction tree) it is possible to intro- 
duce the evidence into the tree and apply what is essentially a variant of be- 
lief propagation to this junction tree. This BP algorithm is operating on sets 
of variables contained in the cliques of the junction tree, rather than on in- 
dividual variables in the original graph. As such, the complexity of the algo- 
rithm scales exponentially with the size of the largest clique in the junction 
tree. For example, if moralisation and triangulation results in a clique con- 
taining K binary variables, the junction tree algorithm would have to store 
and manipulate tables of size 2^ . Moreover, finding the optimal triangulation 
to get the most efficient junction tree for a particular graph is NP-complete 
[47, 48]. 
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8.5 Cutest Conditioning 

In certain graphs the simplest inference algorithm is . which 

is related to the idea of “reasoning by assumptions”. The basic idea is very 
straightforward: find some small set of variables such that if they were given 
(i.e. you knew their values) it would make the remainder of the graph singly 
connected. For example, in the undirected graph in Figure 1, given C or D, the 
rest of the graph is singly connected. This set of variables is called the 
For each possible value of the variables in the cutset, run BP on the remainder 
of the graph to obtain the beliefs on the node of interest. These beliefs can be 
averaged with appropriate weights to obtain the true belief on the variable of 
interest. To make this more concrete, assume you want to find P{X\e) and you 
discover a cutset consisting of a single variable C. Then 

P{X\e)=Y,P{X\C = c,e)P{C = c\e) (42) 

C 

where the beliefs P{X\C = c, e) and corresponding weights P{C = c|e) are 
computed as part of BP, run once for each value of c. 



9 Learning in Graphical Models 

In Section 8 we described exact algorithms for inferring the value of variables in a 
graph with known parameters and structure. If the parameters and structure are 
unknown they can be learned from the data [49]. The learning problem can be 
divided into learning the graph parameters for a known structure, and learning 
the model structure (i.e. which edges should be present or absent).® 

We focus here on directed graphs with discrete variables, although some of 
these issues become much more subtle for undirected and factor graphs [50] . The 
parameters of a directed graph with discrete variables parameterise the condi- 
tional probability tables P{Xi\Xpg_.). For each setting of Xp^. this table contains 
a probability distribution over X{. For example, if all variables are binary and Xi 
has K parents, then this has 2*^+^ entries; however, 

since the probability over Xi has to sum to 1 for each setting of its parents there 
are only 2^ independent entries. The most general parameterisation would have 
a distinct parameter for each entry in this table, but this is often not a natural 
way to parameterise the dependency between variables. Alternatives (for binary 
data) are the noisy-or or sigmoid parameterisation of the dependencies [51]. 
Whatever the specific parameterisation, let 9i denote the parameters relating 



® It should be noted that in Bayesian statistics there is no fundamental difference 
between parameters and variables, and therefore the learning and inference problems 
are really the same. All unknown quantities are treated as random variables, and 
learning is just inference about parameters and structure. It is however often useful 
to distinguish between parameters, which we assume to be fairly constant over the 
data, and variables, which we can assume to vary over each data point. 
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Xi to its parents, and let 9 denote all the parameters in the model. Let m de- 
note the model structure, which corresponds to the set of edges in the graph. 
More generally the model structure can also contain the presence of additional 
hidden variables [52]. 

9.1 Learning Graph Parameters 

We first consider the problem of learning graph parameters when the model 
structure is known and there are no missing or hidden variables. The presence 
of missing/hidden variables complicates the situation. 

The Complete Data Case. Assume that the parameters controlling each 
(a child and its parents) are distinct and that we observe N iid instances of 
all K variables in our graph. The data set is therefore V = and 

the likelihood can be written 

N N K 

p{v\9 ) = n = n ( 43 ) 

n—l n—1 i—1 

Clearly, maximising the log likelihood with respect to the parameters re- 
sults in K decoupled optimisation problems, one for each family, since the 
log likelihood can be written as a sum of K independent terms. Similarly, 
if the prior factors over the 9i, then the Bayesian posterior is also factored: 

p{9\v) = Y{^p{e,\v). 

The Incomplete Data Case. When there is missing/hidden data, the like- 
lihood no longer factors over the variables. Divide the variables in A^") into 
observed and missing components, A^]]^ and A^^. The observed data is now 
T> = {A^J,g . . . Aq^^} and the likelihood is: 

N 

p{v\6) = \{p{xl^^\e) 

n—l 
N 

n=l (n) 

mis 

N K 

n=l „(") i=l 

mis 

where in the last expression the missing variables are assumed to be set to the 
values Because of the missing data, the cost function can no longer be 

written as a sum of K independent terms and the parameters are all coupled. 
Similarly, even if the prior factors over the 9i, the Bayesian posterior will couple 
all the 6i. 

One can still optimise the likelihood by making use of the EM algorithm (Sec- 
tion 3). The E step of EM infers the distribution over the hidden variables given 



(44) 

(45) 

(46) 
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the current setting of the parameters. This can be done with BP for singly con- 
nected graphs or with the junction tree algorithm for multiply-connected graphs. 
In the M step, the objective function being optimised conveniently factors in ex- 
actly the same way as in the complete data case (c.f. Equation (21)). Whereas 
for the complete data case, the optimal ML parameters can often be computed 
in closed form, in the incomplete data case an iterative algorithm such as EM is 
usually required. 

Bayesian parameter inference in the incomplete data case is also substantially 
more complicated. The parameters and missing data are coupled in the posterior 
distribution, as can be seen by multiplying (45) by the parameter prior and 
normalising. Inference can be achieved via approximate inference methods such 
as Markov chain Monte Carlo methods (Section 11.3, [53]) like Gibbs sampling, 
and variational approximations (Section 11.4, [54]). 

9.2 Learning Graph Structure 

There are two basic components to learning the structure of a graph from data: 
scoring and search. refers to computing a measure which can be used 

to compare different structures m and m! given a data set T>. refers 

to searching over the space of possible model structures, usually by proposing 
changes to the current model, so as to find the model with the highest score. This 
view of structure learning presupposes that the goal is to find a single structure 
with the highest score, although of course in the Bayesian inference framework 
it is desirable to infer the probability distribution over model structures given 
the data. 

Scoring Metrics. Assume that you have a prior P{m) over model structures, 
which is ideally based on some domain knowledge. The natural score to use is the 
probability of the model given the data (although see [55]) or some monotonic 
function of this: 

s{m,V) = P{m\V) P{V\m)P{m). (47) 

This score requires computing the 

P{V\m) = j P{V\e,m)P{e\m)de. (48) 

We discuss the intuitions behind the marginal likelihood as a natural score 
for model comparison in Section 10. 

For directed graphical models with fully-observed discrete variables and fac- 
tored Dirichlet priors over the parameters of the conditional probability tables, 
the integral in (48) is analytically tractable. For models with missing/hidden 
data, alternative choices of priors and types of variables, the integral in (48) is 
often intractable and approximation methods are required. Some of the standard 
approximations that can be applied in this context and many other Bayesian in- 
ference problems are briefly reviewed in Section 11. 
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Search Algorithms. Given a way of scoring models, one can search over the 
space of all possible valid graphical models for the one with the highest score [56] . 
The space of all possible graphs is very large (exponential in the number of vari- 
ables) and for directed graphs it can be expensive to check whether a particular 
change to the graph will result in a cycle being formed. Thus intelligent heuristics 
are needed to search the space efficiently [57]. An alternative to trying to find 
the most probable graph are methods that sample over the posterior distribution 
of graphs [58]. This has the advantage that it avoids the problem of overfitting 
which can occur for algorithms that select a single structure with highest score 
out of exponentially many. 



10 Bayesian Model Comparison and Occam’s Razor 

So far in this chapter we have seen many different kinds of models. One of the 
most important problems in unsupervised learning is automatically determining 
which models are appropriate for a given data set. Model selection and compar- 
ison questions include all of the following: 

— Are there clusters in the data and if so, how many? What are their shapes (e.g. 
Gaussian, t-distributed)? 

— Does the data live on a low dimensional manifold? What dimensionality? Is 
this manifold flat or curved? 

~ Is the data discretised? If so, to what precision? 

— Is the data a time series? If so, is it better modelled by an HMM, a state- 
space model? Linear or nonlinear? Gaussian or non-Gaussian noise? How 
many states should the HMM have? How many state variables should the 
SSM have? 

— Gan the data be modelled well by a directed graph? What is the structure of 
this graph? Does it have hidden variables? Are these continuous or discrete? 

Glearly, this list could go on. A human may be able to answer these ques- 
tions via careful use of visualisation, hypothesis testing, and guesswork. But 
ultimately, an intelligent unsupervised learning system should be able to answer 
all these questions automatically. 

Fortunately, the framework of Bayesian inference can be used to provide a 
rational, coherent and automatic way of answering all of the above questions. 
This means that, there is 

an automatic procedure (based on Bayes rule) which provides a unique answer. 
Of course, as always, if the prior assumptions are very poor, the answers obtained 
could be useless. Therefore, it is essential to think carefully about the prior 
assumptions before turning the automatic Bayesian handle. 

Let us go over this automatic procedure. Gonsider a model rrii coming from 
a set of possible models {toi, m 2 , tos, . . .}. For instance, the model rrii might 
correspond to a Gaussian mixture model with i components. The models need 
not be nested, nor does the space of models need to be discrete (although we’ll 
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focus on that case). Given data T>, the natural way to compare models is via 
their probability: 



P{mi\T>) 



P{T>\mi)P{mi) 

P{V) 



(49) 



To compare models, the denominator, which sums over the potentially huge 
space of all possible models, P{T>) = is not required. Prior 

preference for models can be included in P{mi). However, it is interesting to look 
closely at the term (sometimes called the for model 

rrii). Assume that model rrii has parameters 9i (e.g. the means and covariance 
matrices of the i Gaussians, along with the mixing proportions, c.f. Section 2.4). 
The marginal likelihood integrates over all possible parameter values 



P{V\mi) 



P{V\6i, mi)P{9\mi) d9i 



(50) 



where P{9\mi) is the prior over parameters, which is required for a complete 
specification of the model rrii. 

The marginal likelihood has a very interesting interpretation. It is the proba- 
bility of generating data set T> from parameters that are from 

under the prior for rrii. This should be contrasted with the maximum likelihood 
for rrii which is the probability of the data under the single setting of the param- 
eters 9i that maximises P{T>\9i,rrii). Glearly a more complicated model will have 
a higher maximum likelihood, which is the reason why maximising the likelihood 
results in — i.e. a preference for more complicated models than nec- 

essary. In contrast, the marginal likelihood can decrease as the model becomes 
more complicated. In a more complicated model sampling random parameter 
values can generate a wider range of possible data sets, but since the probability 
over data sets has to integrate to 1 (assuming a fixed number of data points) 
spreading the density to allow for more complicated data sets necessarily results 
in some simpler data sets having lower density under the model. This situation 
is diagrammed in Figure 4. The decrease in the marginal likelihood as additional 
parameters are added has been called the ’ [59, 60, 61]. 

In theory all the questions posed at the beginning of this section could be 
addressed by defining appropriate priors and carefully computing marginal like- 
lihoods of competing hypotheses. However, in practice the integral in (50) is 
usually very high dimensional and intractable. It is therefore necessary to ap- 
proximate it. 



11 Approximating Posteriors and Marginal Likelihoods 

There are many ways of approximating the marginal likelihood of a model, and 
the corresponding parameter posterior. In this section, we review some of the 
most frequently used methods. 
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Fig. 4. The marginal likelihood (evidence) as a function of an abstract one dimensional 
representation of “all possible” data sets of some size N. Because the evidence is a 
probability over data sets, it must normalise to one. Therefore very complex models 
which can account for many datasets only achieve modest evidence; simple models can 
reach high evidences, but only for a limited set of data. When a dataset D is observed, 
the evidence can be used to select between model complexities 



11.1 Laplace Approximation 

It can be shown that under some regularity conditions, for large amounts of data 
N relative to the number of parameters in the model, d, the parameter posterior 
is approximately Gaussian around the MAP estimate, 9: 

p{0\V,m) « exp ^-^{6 - e)^A{6 - 6)\ (51) 



Here A is the d x d negative of the Hessian matrix which measures the cur- 
vature of the log posterior at the MAP estimate: 



Aij — 



d^ 

dOidOj 



logp{9\V,TTi) 



e=e 



(52) 



The matrix A is also referred to as the . Equa- 
tion (51) is the to the parameter posterior. 

By Bayes rule, the marginal likelihood satisfies the following equality at any 9: 



p{V\m) 



p{9, T>\m) 
p{9\V, m) 



(53) 



The Laplace approximation to the marginal likelihood can be derived by 
evaluating the log of this expression at 9, using the Gaussian approximation to 
the posterior from equation (51) in the denominator: 
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\ogp{'D\m) « \ogp{9\m) + \ogp{'D\6, to) + ^ log 27 t — ^ log |^| (54) 

11.2 The Bayesian Information Criterion (BIC) 

One of the disadvantages of the Laplace approximation is that it requires com- 
puting the determinant of the Hessian matrix. For models with many parame- 
ters, the Hessian matrix can be very large, and computing its determinant can 
be prohibitive. 

The Bayesian Information Criterion (BIC) is a quick and easy way to com- 
pute an approximation to the marginal likelihood. BIC can be derived from the 
Laplace approximation by dropping all terms that do not depend on N , the 
number of data points. Starting from equation (54), we note that the first and 
third terms are constant with respect to the number of data points. Referring 
to the definition of the Hessian, we can see that its elements grow linearly with 
N . In the limit of large N we can therefore write A = N A, where H is a matrix 
independent of N . We use the fact that for any scalar c and dx d matrix P, the 
determinant |cP| = c‘*|P|, to get 

ilog|H| « ^logiV-h ^log|i| (55) 

The last term does not grow with N, so by dropping it and substituting into 
Eq. (54) we get the BIC approximation: 

log p{V\m) log p{'D\0,m) — ^ log N (56) 

This expression is extremely easy to compute. Since the expression does not 
involve the prior it can be used either when 9 is the MAP or the ML parameter 
estimate, the latter choice making the entire procedure independent of a prior. 
The likelihood is penalised by a term that depends linearly on the number of 
parameters in the model; this term is referred to as the . This is how 

BIC approximates the Bayesian Occam’s Razor effect which penalises overcom- 
plex models. The BIC criterion can also be derived from within the . . 

(MDL) framework. 

The BIC penalty is clearly attractive since it does not require any costly 
integrals or matrix inversions. However this simplicity comes at a cost in accuracy 
which can sometimes be catastrophic. One of the dangers of BIC is that it relies 
on the number of parameters. The basic assumption underlying BIC, that the 
Hessian converges to N times a full-rank matrix, only holds for models in which 
all parameters are identifiable and well-determined. This is often not true. 

11.3 Markov Chain Monte Carlo (MCMC) 

Monte Carlo methods are a standard and often extremely effective way of com- 
puting complicated high dimensional integrals and sums. Many Bayesian infer- 
ence problems can be seen as computing the integral (or sum) of some function 
f{9) under some probability density p{9): 




Unsupervised Learning 103 



f = I f{9)p{e)de. (57) 

For example, the marginal likelihood is the integral of the likelihood func- 
tion under the prior. Simple Monte Carlo approximates (57) by sampling M 
independent draws 6i ~ p{0) and computing the sample average of /: 

1 ^ 

( 58 ) 

i=l 

There are many limitations of simple Monte Carlo, for example it is often not 
possible to draw directly from p. Generalisations of simple Monte Carlo such as 
rejection sampling and importance sampling attempt to overcome some of these 
limitations. 

An important family of generalisations of Monte Carlo methods are Markov 
chain Monte Carlo (MCMC) methods. These are commonly used and power- 
ful methods for approximating the posterior over parameters and the marginal 
likelihood. Unlike simple Monte Carlo methods, the samples are not drawn inde- 
pendently but rather in the form of a Markov chain . . . 9i ^ ^i+i ^ 

9t+2 ■ ■ ■ where each sample depends on the value of the previous sample. MCMC 
estimates have the property that the asymptotic distribution of 9i is the desired 
distribution. That is, limj^oo Pt(^t) = p{9)- Creating MCMC methods is some- 
what of an art, and there are many MCMC methods available, some of which 
are reviewed in [53]. Some notable examples are Gibbs sampling, the Metropolis 
algorithm, and Hybrid Monte Carlo. 

11.4 Variational Approximations 

Variational methods can be used to derive a family of lower bounds on the 
marginal likelihood and to perform approximate Bayesian inference over the 
parameters of a probabilistic models [62, 63, 64]. Variational methods provide 
an alternative to the asymptotic and sampling-based approximations described 
above; they tend to be more accurate than the asymptotic approximations like 
BIC and faster than the MCMC approaches. 

Let y denote the observed variables, x denote the latent variables, and 9 de- 
note the parameters. The log marginal likelihood of data y can be lower bounded 
by introducing any distribution over both latent variables and parameters which 
has support where p(x, 9\y, m) does, and then appealing to Jensen’s inequality 
(due to the concavity of the logarithm function) : 

Inp(yjm) = In /p(y, X, 0|to) dx = In f g(x, 0) ^^^’ dyid9 (59) 

> f ,(x.D) In d^de. (60) 

Maximising this lower bound with respect to the free distribution 9) re- 
sults in < 7 (x, 9) = p(x, 9\y, m) which when substituted above turns the inequality 
into an equality (c.f. Section 3). This does not simplify the problem since evaluat- 
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ing the true posterior distribution p(x, 0|y,m) requires knowing its normalising 
constant, the marginal likelihood. Instead we use a simpler, factorised approxi- 
mation g(x, 0) = q^{x)qg{0): 

lnp(y|m) > f q^{x)qg{6) In dx dO =*' Fm{q^{x),qg{e),y). (61) 

J q^{x)qe{e) 

The quantity Fm is a functional of the free distributions, (?x(x) and qe{S)- 
The variational Bayesian algorithm iteratively maximises Fm in equation (61) 
with respect to the free distributions, ( 7 x(x) and qe{0)- We use elementary calcu- 
lus of variations to take functional derivatives of the lower bound with respect to 
( 7 x(x) and qe{0), each while holding the other fixed. This results in the following 
update equations where the superscript (t) denotes the iteration number: 



^^‘■'■^^(x) oc exp 



lnp{x,y\e,m)ql*\9) dO 



qg~^^\0) Gcp{0\m) exp 



lnp{x,y\0,m) q^~^^'^{x) dx 



(62) 

(63) 



When there is more than one data point then there are different hidden 
variables x^ associated with each data point y* and the step in (62) has to be 
carried out for each i, where the distributions are q^tixi). 

Clearly qg{0) and q^ii^i) are coupled, so we iterate these equations until 
convergence. Recalling the EM algorithm (Section 3 and [65, 66]) we note the 
similarity between EM and the iterative algorithm in (62) and (63). This proce- 
dure is called the and generalises the usual 

EM algorithm; see also [67] and [68]. 

Re-writing (61), it is easy to see that maximising Fm is equivalent to minimis- 
ing the KL divergence between q^ix) qe{0) and the joint posterior p(x, 0|y,m): 



\a.p{y\m)-Fm{q^{x),q0{0),y) = f q^{x) qg{0) In ^s{0) ^ KL(g||p) 

J P(t',xjy,m) 

(64) 

Note that while this factorisation of the posterior distribution over latent 
variables and parameters may seem drastic, one can think of it as replacing 
stochastic dependencies between x and 0 with deterministic dependencies be- 
tween relevant moments of the two sets of variables. To compare between models 
m and m' one can evaluate Fm and Fm'- This approach can, for example, be 
used to score graphical model structures [54]. 

Summarising, the variational Bayesian EM algorithm simultaneously com- 
putes an approximation to the marginal likelihood and to the parameter poste- 
rior by maximising a lower bound. 



11.5 Expectation Propagation (EP) 

Expectation propagation (EP; [23, 69]) is another powerful method for approx- 
imate Bayesian inference. Consider a Bayesian inference problem in which you 
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are given iid data T> = assumed to have come from a model 

p{x\6) parameterised by 0 with prior p{0). The parameter posterior is: 

1 ^ 

p{op) = 

To make the notation more general we can write the quantity we wish to 
approximate as a product of factors over G, 

N N 

= pwY[p{^^''’\d) ( 66 ) 

i—0 2=1 



where /o(^) = p{G) and fi{9) = p(x*^®)|0) and we will ignore the normalising 
constants. We wish to approximate this by a product of terms 






2=0 



(67) 



For example, consider a binary linear classification problem where 6 are the 
parameters of the classification hyperplane and p{6) is a Gaussian prior ([69], 
Chapter 5) . The true posterior is the product of this Gaussian and N likelihood 
terms, each of which defines a half-plane consistent with the class label observed. 
This posterior has a complicated shape, but we can approximate it using EP by 
assuming that each of the approximate likelihood terms fi is Gaussian in G. Since 
the product of Gaussians is Gaussian, q{G) will be a Gaussian approximation 
to the posterior. In general, one makes the approximate terms /* belong to 
some exponential family distribution so the overall approximation is in the same 
exponential family. 

Having decided on the form of the approximation (67), let us consider how 
to tune this approximation so as to make it as accurate as possible. Ideally we 
would like to minimise the KL divergence between the true and the approximate 
distributions: 



min KL 

q(0) 





(68) 



For example, if q{G) is a Gaussian density, minimising this KL divergence will 
result in finding the mean and covariance of the true posterior distribution 
over parameters. Unfortunately, this KL divergence involves averaging with re- 
spect to the true posterior distribution, which will generally be intractable. Note 
that the KL divergence in Equation (68) is different from the KL minimised by 
variational Bayesian methods (64); the former averages with respect to the true 
distribution and is therefore usually intractable, while the latter averages with 
respect to the approximate distribution and is often tractable. Moreover, for ex- 
ponential family approximations the former KL has a unique global optimum, 
while the latter usually has multiple local optima. 
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Since we cannot minimise (68) we can instead consider minimising the KL 
divergence between each true term and the corresponding approximate term. 
That is, for each v. 

minKL(/,(0)||/,(0)) . (69) 

/i(0) ^ ^ 



This will usually be much easier to do, but each such approximate term 
will result in some error. Multiplying all the approximate terms together will 
probably result in an unacceptably inaccurate approximation. On the plus side, 
this approach is non-iterative in that once each term is approximated they are 
simply multiplied to get a final answer. 

The Expectation Propagation (EP) algorithm is an iterative procedure which 
is as easy as the naive approach in (69) but which results in a much more accurate 
approximation. At each step of EP, one of the terms is optimised in the context 
of all the other approximate terms, i.e. for each v. 



min KL 

/i(0 












(70) 



Since the approximate terms depend on each other, this procedure is iterated. 
On the left hand side of the KL divergence the exact term is incorporated 
into Yij^i fjWi which is assumed to be in the exponential family. The right 
hand side is an exponential-family approximation to this whole product. The 
minimisation is done by matching the appropriate moments (expectations) of 
/i(^) The name “Expectation Propagation” comes from the fact that 

each step corresponds to computing certain expectations, and the effect of these 
expectations is propagated to subsequent terms in the approximation. In fact, 
the messages in belief propagation can be derived as a particular form of EP 
where the approximating distribution is assumed to be a fully factored product 
of marginals over the variables in 6, i.e. q{6) = qk{&k) [69]. 

In its simplest form, the EP algorithm can be summarised as in Figure 5. 
Although the algorithm as described here often converges, each step of the algo- 
rithm is not in fact decreasing any objective function so there is no guarantee of 
convergence. Convergent forms of EP can be derived by making use of the EP 
energy function [70] although these may not be as fast and simple to implement 
as the algorithm in Figure 5. 



12 Conclusion 

In this chapter, we have seen that unsupervised learning can be viewed from the 
perspective of statistical modelling. Statistics provides a coherent framework for 
learning from data and for reasoning under uncertainty. Many interesting statis- 
tical models used for unsupervised learning can be cast as latent variable models 
and graphical models. These types of models have played an important role in 
defining unsupervised learning systems for a variety of different kinds of data. 
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Input /o(0) ... /Af(0) 

Initialise fo{0) = fo{0), fi{0) = 1 for i > 0, q{9) = Hi /i(®) 

repeat 



for i = 0 ... 77 do 

Deletion: q\i(9) « ^ 

U{0) 






Projection: /r"(^') ^ argmin KL{fi(0)qyi{0)\\f{0)q\i{0)) 

/(^) 

Inclusion: q{0) ^ <l\i{0) 

end for 

until convergence 



Fig. 5. The EP algorithm. Some variations are possible: this assumes that fo is in the 
exponential family, and updates sequentially over i rather than randomly. The names 
for the steps (deletion, projection, inclusion) are not the same as in [69] 

Graphical models have also played an important unifying framework for think- 
ing about the role of conditional independence in inference in models with many 
variables. While for certain models exact inference is computationally tractable, 
for most of the models in this chapter we have seen that exact inference involves 
intractable sums and integrals. Thus, the study of unsupervised learning has lead 
us into focusing on ways of approximating high dimensional sums and integrals. 
We have reviewed many of the principal approximations, although of course in 
the limited space of this chapter one cannot hope to have a comprehensive review 
of approximation methods. 

There are many interesting and relevant topics we did not get a chance to 
cover in this review of unsupervised learning. One of these is the interplay of 
unsupervised and supervised learning, in the form of semi-supervised learning. 

refers to learning problems in which there is a small 
amount of labelled data and a large amount of unlabelled data. These prob- 
lems are very natural, especially in domains where collecting data can be cheap 
(i.e. the internet) but labelling it can be expensive or time consuming. The key 
question in semi-supervised learning is how the data distribution from the unla- 
belled data should influence the supervised learning problem [71]. Many of the 
approaches to this problem attempt to infer a manifold, graph structure, or tree- 
structure from the unlabelled data and use spread in this structure to determine 
how labels will generalise to new unlabelled points [72, 73, 74, 75]. 

Another area of great interest which we did not have the space to cover are 
. The basic assumption of parametric statistical models 
is that the model is defined using a finite number of parameters. The number 
of parameters is assumed fixed regardless of the number of data points. Thus 
the parameters provide a finite summary of the data. In nonparametric models, 
the number of “parameters” in the model is allowed to grow with the size of the 
data set. With more data, the model becomes more complex, with no a-priori 
limit on the complexity of the model. For this reason nonparametric models 
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are also sometimes called . An important example of this are 

, more formally known as . . , [76, 

77]. These correspond to mixture models (Section 2.4) where the number of 
components is assumed to be infinite. Inference can be done in these models using 
MCMC methods [78, 79, 80], variational methods [81], or the EP algorithm [82]. 
Just as hidden Markov models can be seen as an extension of finite mixture 
models to model time series data, it is possible to extend infinite mixture models 
to hidden Markov models with infinitely many states [83] . Infinite models based 
on Dirichlet processes have also been generalised to be hierarchical in several 
different ways [84, 85]. Bayesian inference in nonparametric models is one of the 
most active areas of research in unsupervised learning, and there still remain 
many open problems. 

As we have seen, the field of unsupervised learning can be understood for- 
mally within the framework of information theory and statistics. However, it is 
important not to lose sight of the tremendous influence ideas from neuroscience 
and psychology have had on the field. Many of the models we have reviewed 
here started life as models of brain function. These models were inspired by the 
brain’s ability to extract statistical patterns from sensory data and to recognise 
complex visual scenes, sounds, and odours. Unsupervised learning theory and 
algorithms still have a long way to go to mimic some of the learning abilities 
of biological brains. As the boundaries of unsupervised learning get pushed for- 
ward, we will hopefully not only benefit from better learning machines and also 
improve our understanding of how the brain learns. 
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1 Motivation and Basic Principles of the Monte Carlo 
Method 

The modern history of Monte Carlo techniques dates back from the 1940’s and 
the Manhattan project. There are earlier descriptions of Monte Carlo experi- 
ments, Buffon’s famous needle experiment is one them, but examples have been 
traced back to Babylonian and old testament times [13]. As we shall see these 
techniques are particularly useful in scenarios where it is of interest to perform 
calculations that involve - explicitly or implicitly - a probability distribution tt 
on a space X (typically X C M"”' for some integer Ux), for which closed- form 
calculations cannot be carried out due to the algebraic complexity of the prob- 
lem. As we shall see the main principle of Monte Carlo techniques consists of 
replacing the algebraic representation of tt, e.g. l/-\/ 27 rexp(=Aa; 2 ) a 

or representation of 7T, e.g. a set of samples Ai, A 2 , ..., Atv 7r(a:) = 

l/\/^exp(^a;^). This proves in practice to be extremely powerful as difficult - 
if not impossible - algebraic calculations are typically replaced with simple 
calculations in the sample domain. One should however bear in mind that these 
are of the true quantity of interest. An important sce- 

nario where Monte Carlo methods can be of great help is when one is interested 
in evaluating expectations of functions, say /, of the type (f{X)) where tt 
is the probability distributions that defines the expectation. The nature of the 
approach, where algebraic quantities are approximated by random quantities, 
requires one to quantify the random fluctuations around the true desired value. 
As we shall see, the power of Monte Carlo techniques lies in the fact that the 
at which the approximation converges towards the true value of interest 
is immune to the dimension Ux of the space X where tt is defined. This is the 
second interest of Monte Carlo techniques. 

These numerical techniques have been widely used in physics over the last 50 
years, but their interest in the context of Bayesian statistics and more generally 
statistics was only fully realized in the late eighties early nineties. Although we 
will here mostly focus on their application in statistics, one should bear in mind 
that the material presented in this introduction to the topic has applications far 
beyond statistics. 

The prerequisites for this introduction are a basic first year undergraduate 
background in probability and statistics. Keywords include random variable. 
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law of large numbers, estimators, central limit theorem and basic notions about 
Markov chains. 

1.1 Motivating Example 

In this section we motivate and illustrate the use of Monte Carlo methods with a 
toy example. We then point out the power of the approach on a “real” example. 

Calculating tt with the Help of Rain and the Law of Large Numbers 

Consider the 2x2 square, say 5 C with inscribed 
disc T> of radius 1 as in Figure 1. 

1 . 5 - 
1 - 
0 . 5 - 
0 - 
- 0 . 5 - 
-1 - 
- 1 . 5 - 

- 1.5 -1 - 0.5 0 0.5 1 1.5 

Fig. 1. A 2 X 2 square S with inscribed disk T> of radius 1 




Imagine that an “idealized” rain falls uniformly on the square S, i.e. the 
probability for a drop to fall in a region A is proportional to the area of A. 
More precisely, let D be the random variable defined on X = 5 representing the 
location of a drop and A a region of the square, then 



P(D e A) 



/_4 dxdy 
Jg dxdy 



( 1 ) 



where x and y are the Cartesian coordinates. Now assume that we have observed 
N such drops, say {Dj, i = 1, . . . , N} as in Figure 2. 
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-1.5 1 ^ ^ ^ ^ ^ ' 

-1.5 -1 -0.5 0 0.5 1 1.5 

Fig. 2. A 2 X 2 square S with inscribed disk T> of radius 1 



Intuitively, without any knowledge of elementary statistics, a sensible tech- 
nique to estimate the probability ¥{D € A) of falling in a given region ^ C 5 
(and think for example of A = V) would consist of using the following formula 



P(D G A) 



number of drops that fell in A 
N 



This formula certainly makes sense, but we would like to be more rigorous 
and give a statistical justification to it. 



V{D G A) Let us first introduce the indicator function of a 

set A, defined as follows. 



"^A{x,y) 



1 if point D = (x, y) G A, 
0 otherwise. 



We define the random variable V{D) := := I_4(A1, T), where X,Y are 

the random variables that represent the Cartesian coordinates of a uniformly 
distributed point on S, denoted D ^ Us- Using U, it is not hard to show that 

¥{DgA) = J lA{x,y)^dxdy = Eus(V), 

where for a probability distribution tt we will denote E^r the expectation with 
respect to tt. 
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Now, similarly, let us introduce {Vi := V{Di),i = 
1, . . . , iV} the random variables associated to the drops {Di,i = l,...,iV} and 
consider the sum 



Sn = 



N ' 



(2) 



We notice that an alternative expression for S'at is 



Sn 



number of drops that fell in A 
N 



which corresponds precisely to the formula which we intuitively suggested to 
approximate ¥{D e A). However Eq. (2) is statistically more explicit, in the 
sense that it tells us that our suggested approximation of F{D G A) is the 
empirical average of independent and identically distributed random variables, 
{Vi,i = l,...,iV}. Assuming that the rain lasts forever and therefore that N 
+ 00 , then one can apply the ^ (since E;^g(|E|) < +oo here) 

and deduce that 

lim Sn = (almost surely). 

N— »- + oo 



As we have already proved that F{D £ A) = Fus{V), the law of large 
numbers mathematically justifies our intuitive method of estimating F{D G A), 
provided that N is large enough. 



7T We note that as a special case we have defined 
a method of calculating tt. Indeed, 

F{DgV) = l^^dxdy=^. 

S' AT as defined in Eq. (2) with A = I? is an unbiased estimator of 7t/4, which 
is also ensured to converge towards 7t/4 for N very large. The quantity Sn — 
7t/4 for a day of rain as a function of the number of drops for one rainfall 
is presented in Figure 3. However in practice one is interested in obtaining a 
result in finite time, i.e. for N finite. Sn is a random variable which can be 
rewritten as Sn = tt/4 + En where En is a random error term. It is naturally 
of interest to characterize the precision of our estimator, i.e. characterize the 
average magnitude of the fluctuations of the random error En, SiS illustrated in 
Figure 4. A simple measure of the average magnitude of En is its variance, 

var{EN) = var{SN) = ^var{Vi), 

as the {Vi,i = 1,...,A^} are independent. It is worth remembering that since 
S' AT is unbiased, 

^var(SN) = v"E [{Sn - F{D G V)^], 



which using the result above implies that the between Sn and 

F{D G T>) decreases as 1/\/N. This is illustrated in Figure 5 where the dotted 
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Fig. 3. Convergence of Sn — 7t/ 4 as a function of the number of samples, for one 
realization (or rainfall) 



0.03 



0.02 




100 200 300 400 500 600 700 800 900 



Fig. 4. Convergence of Sn — for 100 realizations of the rain 



lines represent -ii\Jvar{V)/N and the dashed lines represent the empirical mean 
square error Sn — Tr/d estimated from the 100 realizations in Figure 4. One can 
be slightly more precise and first invoke here an asymptotic result, the 

(which can be applied here as var{V) < +oo). As N ^ +oo, 

'/NSn N{TT/Avar{V)), 

which implies that for N large enough the probability of the error being larger 
than 2\Jvar{V)/N is 
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(\Sn - 7t/4| > 2^yvar{V)/N'^ ~ 0.05, 



with 2y^var{V) = 0.8211. In the present case (we are sampling here from a 
Bernoulli distribution) one can be much more precise and use a non-asymptotic 
result. Indeed, using a Bernstein type inequality, one can prove [22, p. 69] that 
for any integer N > 1 and £ > 0, 

P {\Sn — 7r/4| > e) < 2exp (— 21V£^) 
which tells us that for any a G (0, 1] , 

P {\Sn — 7r/4| > £) < a 



which on the one hand provides us with a minimum number of samples in order 
to achieve a given precision of a, 



log (2/g) 
2£2 



where for a real x the quantity [x] denotes the integer part of x, or alternatively 
tells us that for any iV > 1, 

p - ^/4| > < 0.05 



with i/log (40) /2 = 1.3541. 

Both results tell us that in some sense the approximation error is inversely 
proportional to '/N. 



Now consider the case where X = for any 
integer n^, and in particular large values of n^- Replace now S and T> above 
with a hypercube 5"”' and an inscribed hyperball P”® in X. If we could observe 
a hyper-rain, then it would not be difficult to see that the method described 
earlier to estimate the area of T> could be used to estimate the volume of P"® . 
The only requirement is that one should be able to tell if a drop fell in or not: 
in other words one should be able to calculate I-pnx (D) point-wise. Now a very 
important result is that the arguments that lead earlier to the formal validation 
of the Monte Carlo approach to estimate 7 t/ 4 remain identical here (check it to 
convince yourself!). In particular the rate of convergence of the estimator in the 
mean square sense is again . nx- 

This would not be the case if we were using a deterministic method on a grid of 
regularly spaced points. Typically, the rate of convergence of such deterministic 
methods is of the form where r is related to the smoothness of the 

contours of region A, and is N the number of function (here I^i) evaluations. 
Monte Carlo methods are thus extremely attractive when Ux is large. 
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A More General Context. In the previous subsection, we have seen that a 
simple experience involving the rain can help us to evaluate an in 

an extremely simple way. In this subsection we generalist the ideas developed 
earlier in order to tackle the generic problem of estimating 

= f f{x)w{x)dx, 

Jx 

where / : X ^ and tt is a probability distribution on X C We will 
assume that E,„.(|/(a;)|) < +oo but that it is difficult to obtain an analytical 
expression for ETr(/(a;)). 



1.2 Generalization of the Rain Experiment 

In the light of the square/circle example, assume that A >> 1 samples 
xb) ~ TT {i = 1, . . . ,N) are available to us (since it is unlikely that rain can 
generate samples from any distribution tt, we will address the problem of sample 
generation in the next section) . Now consider any set A C X and assume that we 
are interested in calculating 7t{A) = P(A G A) for A ~ tt. We naturally choose 
the following estimator 

(A) ^lumber of samples in A 
total number of samples ’ 

which by the law of large numbers is a consistent estimator of tt{A) since 

1 ^ 

^^I.4(A.)=E^(I^(A)) = 7r(A). 

N^+oo ^ ' 
i—1 

A way of generalizing this in order to evaluate E,„.(/(x)) consists of considering 
the estimator 

1 ^ 

' i=l 

which is unbiased. From the law of large numbers S]\[{f) will converge and 

1 ^ 
i—1 

Here again a good measure of the approximation is the variance of SnU), 



var^ [S'at (/)] = var^ 



N 



vE/A') 



2 = 1 



yarn [/(A)] 
N 



Now the central limit theorem applies if war^ [/(A)] < oo and tells us that 
Sn (/) Af (yNEM{X)),var^ [/(A)]) , 
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and the conclusions drawn in the rain example are still valid here: 

1. The rate of convergence is immune to the dimension of X. 

2. It is easy to take complex integration domains into account. 

3. It is easily implementable and general. The requirements are 

(a) to be able to evaluate f{x) for any x G X, 

(b) to be able to produce samples distributed according to tt. 



1.3 Prom the Algebraic to the Sample Representation 

In this subsection we make explicit the - approximate - sample representation 
of TT. Let us first introduce the delta-Dirac function 6xq for xq G X, defined as 
follows 

[ f{x)6xoix)dx = f{xo), 

Jx 



for any / : X ^ . Note that this implies in particular that for A C X, 



/ lA{x)6xo{x)dx = / = I^(xo). 

Jx J A 

Now, for Aj ~ 7T for i = 1, . . . , iV, we can introduce the following mixture of 
delta-Dirac functions 

1 ^ 

(a;) ■■= ’ 

which is the . . of the sample, and consider for any A c X 



r r 1 1 

^n{A) = TTjv (x) dx = ^ / — 6 xi (x) = ^1.4(2;). 

i=l i=l ^ 



which is precisely Sn(Ja)- What we have touched upon here is simply the sample 
representation of tt, of which an illustration can be found in Figure 6 for a 
Gaussian distribution. The concentration of points in a given region of 
the space represents tt. Note that this approach is in contrast with what is 
usually done in parametric statistics, i.e. start with samples and then introduce a 
distribution with an algebraic representation for the underlying population. Note 
that here each sample Xi has a weight of l/N, but that it is also possible to 
consider weighted sample representations of tt: the approach is called 
and will be covered later on. 

Now consider the problem of estimating (/) . We simply replace tt with its 
sample representation ttjv and obtain 



r 1 1 r 

i=l ^ i=l ^ 



1 ^ 



which is precisely Sn (/), the Monte Carlo estimator suggested earlier. The 
interest of this approximating representation of tt will become clearer later, in 
particular in the context of importance sampling. 
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Fig. 5. Variance of S'jv — tt/ 4 across 100 realizations as a function of the number of 
samples and the theoretical variance 




Fig. 6. Sample representation of a Gaussian distribution 



1.4 Expectations in Statistics 

The aim of this subsection is to illustrate why it is important to compute expec- 
tations in statistics, in particular in the Bayesian context. 

Assume that we are given a Bayesian model, i.e. a likelihood p{y\0) and a 
prior distribution p{9). We observe some data y and wish to estimate 9. In a 
Bayesian framework, all the available information about 9 is summarized by the 
posterior distribution, given by Bayes’ rule. 
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p{0\y) 



p{y\9)p{e) 



The expression looks simple, but the bottom of the fraction is an integral, 
and more precisely an expectation 



Ep(e)(p(y|6*)) = [ p{y\0)p{9)de. 

J0 

In many situations this integral typically does not admit a closed-form ex- 
pression. 

We observe ?/= (j/i, 2 / 2 ) ■ • ■ ) J/r) which are such that yi ^ Af {pj , aj) 

with probability pj for j = 1,2. Here 9 = (/ri, /i 2 , Ui, cr|,pi). The likelihood in 
this case is 



p{y\0) = n 



Pi 









+(1-Pl)- 



. (yj-f^2r 






The normalizing constant of the posterior can be complicated, e.g. impose 
constraints on the parameters < 10ct| -I- y/p\P 2 and p 2 < br- 
other important examples include the evaluation of the posterior mean square 
estimate of 9, 

9mse := Ep(0|y)(0) = [ 9p{9\y)d9, 

J0 



the median, i.e. the solution 9median of 



/ -l-oo 

1(0 < 9median)p{9\y)d9 = 1/2. 

-oo 

but also the evaluation of the marginal posterior distribution p{9^y) of p{9i , 02 |y) > 

p{Si\y)= [ p{9i,92\y)d92 

J0 

= f p{0i\92,y)p{92\y)d92 

J0 

= Ep^e^\y){p{9i\92,y)) ... 

Similar problems are encountered when computing, marginal posterior means, 
posterior variances, posterior credibility regions. 



1.5 A Simple Application 

In 1786 Laplace was interested in determining if the probability 0 of a male 
birth in Paris over a certain period of time was above 0.5 or not. The official 
figures gave y\ = 251,527 males birth for ?/2 = 241,945 female births. The ob- 
served proportion was therefore 0.509,. We choose a uniform distribution as prior 
distribution for 0 the proportion of male births. The posterior distribution is 
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p{e\y) = Be{0; 251528, 241946). 



Imagine that we have no table and are interested in the posterior mean of 
this posterior distribution. Furthermore, imagine that we can sample (using a 
computer) a large number N of independent samples {9i, i = 1, . . . , N) from this 
distribution. One could propose the following estimator 



1 ^ 






as from the law of large numbers, 



1 



N 



i—1 

We could also estimate the posterior variance as 



i=l 

Now consider the following more challenging problems: we want to find esti- 
mates of the median of this posterior distribution, as well as a 95% credibility 
interval. We start with the median, and assume that we have ordered the sam- 
ples, that is for any i < j, 9i < 9j and for simplicity that N is an even number. 
Let 9 be the median of the posterior distribution. Then we know that 

^- 1-00 



N 



>9) = 



l{9 < 9)p{9\y)d9 = 1/2 



<9) = 



f-l-oo 



1(61 < 9)p{9\y)d9 = 1/2 



so that (assuming for simplicity that N is even and that we have ordered {9i,i = 
1, . . . ,N)), it is sensible to chose an estimate for 9 between 0 tv /2 and 9n/2+i- 
Now assume that we are looking for 9~ and 0+ such that 



<9<9+) = 



f + OO 



I{9~ <9< 9+)p{9\y)d9 = 0.95 



or 

P(0 < 61 < 9~) = 0.025 and P( 6 »+ < 6 » < 1) = 0.025 

and assuming again for simplicity that N = 1000 and that the samples have 
been ordered. We find that a reasonable estimate of 9~ is between 025 and 026 
and an estimate of 9^ between 0975 and 0976 - Finally we might be interested in 
calculating 

1-0.5 pi 

P(0 < 0.5) = / p{9\y)d9 = I{9 < Q.b)p{9\y)d9 

Jo Jo 
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which suggests the following estimator of this probability 



P(0<O.5)~ <0.5). 

(one can in fact find that P(6* < 0.5|yi,j/2) = 1.146058490255674 x lO”"*^). 



1.6 Further Topic: Importance Sampling 

In this subsection we explore the important method of importance sampling.^ 
This method is of interest either in the case where samples from the desired 
distribution tt are not available, but samples from a distribution q are, or as a 
way of possibly reducing the variance of an estimator. 



Importance Sampling. Consider a probability distribution q such that tt{x) > 
0 q{x) > 0. Then one can write 

^ 7 v{f{x))= [ f{x)Tr{x)dx= [ f{x)^^^q{x)dx = V.q{w{x)f{x)) 

Jx Jx 

w(x) 



We are now integrating the function w{x)f{x) with respect to the distribution 
q. Now provided that we can produce N . . samples Xi, . . . from q, then 
one can suggest the following estimator 




i=l 



7t(X,) 

q{Xi) 



f{Xi) 




AXj) 

q{Xi) 



6xi {x)dx. 



It is customary to call Wi = the . ^ , and q the importance 

distribution. Now it is natural to introduce a delta-Dirac approximation of tt is 
of the form 

1 ^ 

tin{x) = -^y^^Wjdxjjdx) 

i=l 

The interpretation of this weighted empirical measure is rather simple. Large 
Wi’s indicate an underrepresentation of tt by samples from q around Xi. Small 
Wi’s indicate an overrepresentation of tt by samples from q around X^. This 
phenomenon is illustrated in Figure 1 where the importance weights required 
to represent a double exponential with samples from either a Gaussian or a t- 
Student are presented. Note that in the case where q = tt then Wi = 1/N and 
we recover the representation presented earlier. 

It is also worth noticing that if the normalizing constants of tt and/or q are 
not known, then it is possible to define (with tt*{x) oc tt{x) and q*{x) (x q{x)) 

TT*{X,)/q*{X,) 

EU^*(X,)/q*{X,y 



This material can be skipped at first. 



1 
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0 10 20 30 40 50 60 70 80 90 100 




0 10 20 30 40 50 60 70 80 90 100 




Fig. 7 . Top: The three distributions. Middle: importance weights to represent a double 
exponential with samples from a Gaussian. Bottom: importance weights to represent 
a double exponential with samples from a t-Student 



And consider the following estimator 



N 



N 






7r*{X,)/q*{X,) 






2=1 



nx,)- 



i—l Z^j—1 
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This estimator is , but 

hi Ef=i T^*{Xj)!q*{Xj) " limw^oo ^ Ef=i T^*{Xj)!q*{Xj) 

fx / ( 2 ^) ^ (^) 9 (^) 

/x w (x) <7 (x) dx 

as the unknown normalizing constants cancel. 

f J In a Bayesian framework the target distribution is tt (0) = 
p(0jy), the posterior distribution. One can suggest (and this is not necessarily a 
good choice) q (0) = p{0). In this case the weights will be proportional to the 
likelihood since 

w{0)=p {0\y) Ip (0) (X ^ P • 

Unfortunately this technique is not as general as it might seem. Let us con- 
sider the variance of the importance sampling estimator in the simple case where 
the normalizing constants are known and where f = C, i.e. is a constant. In this 
case 

varq{lN(f)) = ^ (^1) - E? 

which suggests that even in the simplest case the variance of the weights should 
be finite and as small as possible for the variance of In if) to be small. The 
examples provided earlier in Figure 1, where tt was a double exponential and q 
either a normal or t-Student distribution, illustrates the possibly large variations 
of the weights. 



Zero Variance Estimator. Here we illustrate a possible interest of importance 
sampling, which is however specialized. We start with the trivial remark that the 
variance of a constant function is null, i.e. -uar,r [/] = 0 if / is a constant. We seek 
here to exploit this property in the context of Monte Carlo integration, although 
this might seem of little interest at first sight since no numerical method is 
needed to evaluate E^(/) for a constant function /. However we are going to use 
this as a motivation to describe a method of reducing the variance of a Monte 
Carlo estimator for a fixed number of samples. Now assume that E.„.(/) and that 
for simplicity / > 0. Using the convention 0/0 = 1 we can rewrite E,r(/) as 



/ f{x)TT{x)dx = / -j^'JT{x)f{x)dx 

lx Jx 



that is 



E^(/)=E,(E^(/)) 
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where 



q{x) 



T^{x)f{x) 

Jy^7r(x')f(x')dx' 



can be thought of as being a probability density. If we could sample from q then 
we could integrate the constant function J^TT{x")f(x")dx" and obtain a zero 
variance estimator. Naturally we have not solved the problem since the constant 
is precisely the integral that we are seeking to calculate! 

The calculations above can be generalized to functions / that are not every- 
where positive, with in this case, 



q{x) 



l/(a^)k(g:) 



Despite our disappointing/absurd results, the strategy however suggests ways 
to improve the constant varT^(f), by trying to sample from a distribution close to 
q. Note however that q depends on /, and that as a consequence such a method 
is therefore very specialized. 



Conclusions. To summaries, the pros and cons of importance sampling are as 
follows: 

— Advantages. Easy to implement, parallelizable, sequential version are pos- 

sible (particle filter etc.). If g is a clever approximation of tt, then we typically 
expect good results. It can be used a specialized way of reducing the variance 
of estimators. ^ 

— Drawbacks. If we do not have varT^{w{x)) < -l-oo, then typically InU) 
can be a poor estimator since its variance is large. This poses the problem 
of the choice of q (x)? Where are the modes of tt (x)? Importance sampling 
is typically limited to small dimensions for the parameter space, say Ux = 
10—50 depending on the application. 

Despite the possible drawbacks, importance sampling has proved to be ex- 
tremely useful in the context of sequential importance sampling. 



2 Classical “Exact” Simulation Methods 

In this section we review some classical simulation techniques. We call those 
techniques “exact” as they allow one to generate samples in a 

of a procedure. Note that the instant when a sample from the distri- 
bution of interest is produced is identifiable, that is we can stop the procedure 
and be sure that we have generated a sample from the distribution of interest. 
As we shall see in the next section this is not always the case. Unfortunately 
the simulation techniques presented in this chapter cannot typically be used in 
order to sample from complex distributions as they tend not to scale well with 
the dimension n^, and cases where little is known about tt. However these tech- 
niques can be thought of as being building blocks of more complex algorithms 
that will be presented in the next chapter. 
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From now on we will assume that a computer can generate . 

, or at least that it can generate a good 
approximation of such random variables (indeed computers should usually fol- 
low a deterministic behavior, and one must find ways around in order to produce 
something that looks random). 

2.1 The cdf Inversion Method 

We present here this method in the case where X = K for simplicity. The multi- 
variate generalization is not difficult. First we consider a simple discrete example 
where X S X = {1, 2, 3} and such that 

P(X = 1) = P(^ = 2) = 

Define the cumulative probability distribution (cdf) of X as 

3 

Fx{x) = P(X <x) = ^P(A1 = i)I{i < x) 

for X G [0, 3] and its inverse 

(u) = inf {x G X; Fx (x) > u} , 

for u G [0, 1]. The cdf corresponding to our example is represented in Figure 8. A 
method of sampling from this distribution consists of sampling u ^ U{0, 1) and 
find x = F^^ (u). The probability of u falling in the vertical interval i is precisely 
equal to the probability P(A = i). The method indeed produces samples from 
the distribution of interest. 

Now in the continuous case, and assuming that the distribution has a density 
the cdf takes the form 

/ + 00 pX 

TT (u) I{u < x)du = / TT (u) du. 

-OO j — OO 

A normal distribution and its cdf are presented in Figure 9. Intuitively the 
algorithm suggested in the discrete case should be valid here, since modes of 
7T mean large variations of Fx and therefore a large probability for a uniform 
distribution to fall in these regions. 

More rigorously, consider the algorithm 

Sample u ~ f^(0, 1) and set Y=F~^ (u). 

We prove that this algorithm produces samples from tt. We calculate the cdf 
of X produced by the algorithm above. For any y G X we have 

P(y<y) = p(y=i^-i(u)<|/) 

= P {u<Fx (y)) since Fx is non decreasing 

fFx(v) 

= / I{u<Fx{y)) y-ldu= / du = Fx{y), 



0 



0 
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Fig. 8. The distribution and cdf of a discrete random variable 





Fig. 9. The distribution and cdf of a normal distribution 



which shows that the cdf of V produced by the algorithm above is precisely the 
cdf of X ~ 7T. 

Consider the exponential distribution with parameter 1, i.e. X ~ 
7T (a;) = exp (— x) I[o.+oo) (a;)- The cdf of X is Fx (x) = 1 — exp(— x). Now the 
inverse cdf is (u) = — log (1 — u), and for u ^U{0, 1) then — log (1 — u) ~ tt. 

This example is interesting as it illustrates one of the fundamental idea of 
most simulation methods: sample from a distribution from which it is easy to 
sample (here the uniform distribution) and then transform this random variable 
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(here through F^^). However this method is only applicable to a limited number 
of cases as it requires a closed form expression of the inverse of the cdf, which 
is not explicit even for a distribution as simple and common as the normal 
distribution. 



2.2 The Rejection Method 

The rejection method allows one to sample according to a distribution tt which 
is only known up to a proportionality constant, say tt* cx tt. It relies again on the 
assumption that samples can be generated from a so-called distribution 

q defined on X, which might as well be known only up to a normalizing constant, 
say q* oc q. Then, instead of being transformed by a deterministic function as 
in the inverse cdf method, the samples produced from tt are either rejected or 
accepted. More precisely, assume that for any x € X, C = sup^j^x g. < +oo 
(note that this imposes that for any a; S X, 7r*(a;) > 0 ^ q*{x) > 0) and consider 
C > C. Then the accept/reject procedure proceeds as follows: 

Accept/Reject procedure 
1. Sample Y^q and u ^ U {0, 1). 

2- If M < then return T; otherwise return to step 1. 

The intuition behind the method can be understood from Figure 10. 




Fig. 10. The idea behind the rejection method 



Now we prove that P(T < x\Y accepted) = P(A < x). We will extensively 
use the trivial identity 

q*{x) 
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For any x G X, consider the joint distribution 

r*l px 



P(y < a; and Y accepted) = / / I{u< ^ -, )Q{y) x Idydu 

Jo J-ao ^ Q*{y) 



/o J-oc C'q*{y)' 

^ /-oo T^*{y)dy 
C'J^q*(y)dy’ 

and the probability of being accepted is the marginal of P (F < a; and Y accepted) , 
that is 



’ (Y accepted) = ^ ^d= 



fx (y) dy 

C'f^q* (y) dy 



( 3 ) 



Consequently 

fl (y) dy 
P(y < x\Y accepted) = - — 

JxTT* [y)dy 

The expression for the probability of being accepted in Eq. (3) tells us that in 
order to design an efficient algorithm, C should be chosen as small as possible, 
and that the optimal choice corresponds to C . However this constant might be 
very large, in particular for large rix and C might not even be known. In the 
most favorable scenarios, at best an upper bound might be known. 

We want to sample from a Be{x]a,(3) oc x°‘~^{l — x)d~^ distribu- 
tion. We can generate samples from f^(0, 1). One can find sup,j,g[g 

analytically for a,(3 > 1\ Note that we do not assume known the normalizing 
constant! 




Let us assume that one wants to simulate samples from tt (0) = 
p{9\y) (X p(?/|6l)p (0). We assume that p(i/|0) is known analytically and p(i/|0) < C 
for any 6, where C is known. We also assume that we are able to simulate from 
p{9). Thus one can choose q{9) = p{9) and use the accept/reject procedure to 
sample from p{9\y). Indeed 



p{d\y) ^ p{y\d) ^ 
p{d) p{y) ~ p{y) 



( 4 ) 



is bounded and 

^ p{d\y) ^ p{y\d) 

Mq{9) ^p(0) c ^ ^ 

can be evaluated analytically. However, the acceptance rate 1/M is usually 
unknown as it involves p{y) which is itself usually unknown. 
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We can summaries the pros and cons of the accept /reject procedure: 

— Advantages: 

1 . seems rather universal, and compared to the inverse cdf method requires 
less algebraic properties. 

2. in principle neither the normalization constant of tt nor that of q are 
needed. 

Drawbacks: 

1. how to construct the proposal q (x) to minimize (7? 

2. typically C increases exponentially with Ux- 



2.3 Deterministic Transformations 



These methods rely on clever changes of variables, which transform one distri- 
bution to another. A typical setup is the following: consider Y ^ q from which 
it is easy to sample, and consider g : X ^ X a differentiable and one-to-one 
transformation. Now define the transformed random variable 



X = g{Y). 



We know that the density, say tt, of A can be expressed in terms of q and 



the Jacobian 



(£) 

dx 



of the transformation g as follows 



t:{x) 






dg 

dx 



Naturally for a predefined tt it is not always obvious to find proper g and 
q, but we present here a celebrated example. The Box-Muller transformation is 
a method of transforming two uniformly distributed random variables Yi 
and Y 2 on [0, 1] into two normally distributed random variables Xi and X 2 
with distribution A/”(0, 1). The transformation is as follows 

Ai = i/-21og (Yi)cos (27ry2) 

A 2 = i/-21og (Yi) sin (27ry'2) • (6) 



We compute the inverse transformation and find that 
Yi = exp {-{Xl + A|)/2) 

1 1 { X2\ 

Now one can check that the Jacobian of the transformation is 
, ^,2 exp {-{xl + xl)/2) . 

Consequently 

tt{xi,X 2 ) = ,2 exp {-{xj + xl)/2) x 1, 
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which proves the result. This method is simple to implement on a computer, 
and is to a certain extent efficient in the sense that two uniformly distributed 
random variables Yi and Y 2 give two normally distributed random variables Xi 
and X 2 through the deterministic transformation in Eq. ( 6 ). In this sense no 
computation is wasted in producing samples that are ultimately rejected. Note 
however that this transformation requires the evaluation of log and cos which 
can be costly in terms of computer time, and even more efficient alternatives 
have been proposed in the literature. 

Although apparently limited, this type of transformation can be very useful in 
practice to sample from simple distributions that are then fed into more complex 
algorithms. Most of the efficient algorithms to sample from gamma’s, beta’s etc. 
are a mixture of such deterministic transformations and the accept /rejection 
method. 

3 MCMC Methods 

3.1 Motivation 

So far we have seen methods of sampling from relatively low dimensional dis- 
tributions, which in fact collapse for even modest dimensions. For example con- 
sider the following -over-used- Bayesian example, the nuclear pump data example 
(Gaver and O’Muircheartaigh, 1987). This example describes multiple failures 
in a nuclear plant with the data, say y, given in the following table: 



Pump 1 2 3 4 56789 10 

Failures 5 I 5 14 3 19 I I 4 W~ 

Times 94.32 15.72 62.88 125.76 5.24 31.44 1.05 1.05 2.10 10.48 



The modeling is based on the assumption that the failures of the i— th pump 
follow a Poisson process with parameter Aj (1 < i < 10). For an observed time 
ti, the number of failures pi is thus a Poisson V{Xiti) random variable. The 
unknowns here consist therefore of 9 := (Ai, . . . , Aio, /3) and the aim here is to 
estimate quantities related to p{9\y). For reasons invoked by the authors one 
chooses the following prior distributions, 

Ai Ga{a, (3) and (3 ~ Ga{'^, ^) 

with a = 1.8 and 7 = 0.01 and 5=1. Note that this introduces a , 
parameterization of the problem, as the hyperparameter (3 is considered unknown 
here. A prior distribution is therefore ascribed to this hyperparameter, therefore 
robustifying the inference. The posterior distribution is proportional to 

10 

J]^{(Aifi)^* exp(-Aiti)A““^ exp(-/3Ai)}/3^°“/3'^“^ exp(-5/3) 

10 

0 ^ exp(-(ti -b I3)\)}(3'^°°‘^^~'^ exp(-5/3). 

2 = 1 
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This multidimensional distribution is rather complex, and it is not obvious 
how the inverse cdf method, the rejection method or importance sampling could 
be used in this context. However one notices that the following conditionals have 
a familiar form. 



Wifi, U, Pi) ~ Qafpi + a,ti +/3) for 1 < i < 10 

10 

/3|(Ai, . . . , Aio) ~ QaW + lOci) ^ A^), (7) 

i=l 

and instead of directly sampling the vector 9 = (Ai, . . . , Aio, /?) at once, one could 
suggest sampling it progressively and iteratively, starting for example with the 
Ai’s for a given guess of /3, followed by an update of /3 given the new samples 
Ai, . . . , Aio- More precisely, given a sample, at iteration t, 0* := (Aj, . . . , Ajg, W) 
one could proceed as follows at iteration t + 1, 

1. XW\iP*,ti,Pi) ~ Ga{p, + a,ti + fi*) for 1 < z < 10, 

2. /3‘+i|(A‘+\ . . . , A‘+i) ~ gaW + 10a, 5 + E.=i A‘+')- 

This suggestion is of great interest: indeed instead of directly sampling in 
a space with 11 dimensions one samples in spaces of dimension 1, which can 
be achieved using either of the methods reviewed in previous sections. However 
the structure of the algorithm calls for many questions: by sampling from these 
conditional distributions are we sampling from the desired joint distribution? If 
yes, how many times should the iteration above be repeated? In fact the validity 
of the approach described here stems from the fact that the sequence {6*} defined 
above is a Markov chain and, as we shall see, some Markov chains have very nice 
properties. 

3.2 Intuitive Approach to MCMC 

Basic Concepts. Assume that we wish to sample from a distribution tt. The 
idea of MCMC consists of running an ergodic Markov chain. In order to illustrate 
this intuitively, consider Figure 11. The target distribution corresponds to the 
continuous line. It is a normal distribution. We consider here 1000 Markov chains 
run in parallel, and independent. We assume that the initial distribution of these 
Markov chains is a uniform distribution on [0,20]. We then apply a (specially 
designed) Markov transition probability to all of the 1000 samples, in an inde- 
pendent manner. Observe how the histograms of these samples evolve with the 
iterations. Obviously the normal distribution seems to “attract” the distribution 
of the samples and even to be a fixed point of the algorithm. This is is what we 
wanted to achieve, i.e. it seems that we have produced 1000 independent sam- 
ples from the normal distribution. The numbers 1,2, 3,4 and 5 correspond to 
the location of samples 1, 2, 3,4 and 5 along the iterations. In fact one can show 
that in many situations of interest it is not necessary to run N Markov chains 
in parallel in order to obtain 1000 samples, but that one can consider a unique 
Markov chain, and build the histogram from this single Markov chain by forming 




Monte Carlo Methods for Absolute Beginners 



135 





-20 0 20 







Fig. 11. From top left to bottom right: histograms of 1000 independent Markov chains 
with a normal distribution as target distribution 

histograms from one trajectory. This idea is illustrated in Figure 12. The target 
distribution is here a mixture of normal distributions. Notice that the estimate 
of the target distribution, through the series of histograms, improves with the 
number of iterations. Assume that we have stored {X^, 1 < i < N} for N large 
and wish to estimate f{x)'K{x)dx. In the light of the numerical experiments 
above, one can suggest the estimator 

1 ^ 

i=l 

which is exactly the estimator that we would use if {Ali, 1 < i < Af} were 
independent. In fact, it can be proved, under relatively mild conditions, that 
such an estimator is consistent , 

! Under additional conditions, a central limit theorem also holds for 
this estimator, and the rate of convergence is again l/\/N. Note however that 
the constant involved in the CLT will be different from the constant in the in- 
dependent case, as it will take into account the fact that the samples are not 
independent. 
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Fig. 12. Sampling from a mixture of normal distributions following the path of a single 
Markov chain. Full line: the target distribution - Dashed line: histogram of the path. 
Top: 1000 iterations only. Bottom: 10000 iterations 



Unfortunately not all Markov chains, with transition probability say P, will 
have the following three important properties observed above: 

1. The desired distribution tt is a “fixed point” of the algorithm or, in more 
appropriate terms, an of the Markov chain, i.e. 
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2. The successive distributions of the Markov chains are “attracted” by tt, or 
converge towards tt. 

3. The estimator 

1 ^ 
i=l 

is consistent, and converges towards E^(/(X)). 



The first point is easily solved: the Metropolis-Hastings algorithm provides 
us with a mechanism of building Markov chains that admit a given dis- 

tribution 7T as invariant distribution, whose density is known 

. Note that this later property is very convenient in a Bayesian 
framework! The reason for which the Metropolis-Hastings algorithm admits any 
desired distribution tt as invariant distributions stems from the fact that it is 
with respect to tt, i.e. for any x,y 



Tr{x)P{x,y) = Tr{y)P{y,x) 



and therefore automatically admits tt as invariant distribution (indeed integrate 
the equality above with respect to x over X). In order to answer the second and 
third points one needs to introduce two notions: and 

. The notion of reducibility (i.e. non-irreducibility) is illustrated in Figure 13: 
the Markov chain cannot reach a region of the space X where the distribution 
7T has positive mass. Therefore irreducibility means that two arbitrarily chosen 
points in X with positive densities, can always communicate in a finite number 
of iterations. It is quit remarkable that under this simple condition, provided 
that TT is an invariant distribution of the Markov chain and E^(|/(a;)|) < -|-oo, 
then f{xi) is consistent (see [24]). In order to ensure that the se- 

ries of distributions of the Markov chain converges it is furthermore necessary 
to ensure aperiodicity. To illustrate this, consider the following toy example. 
X = {1, 2} and H(l, 2) = 1 and P{2, 1) = 1. One easily checks that 



Tt’^P = TT^ 



0 1 
1 0 



admits the solution tt = (1/2, 1/2)^, i.e. tt is an invariant distribution of the 
Markov chain. Clearly this chain has a periodic behavior, with period 2, so that 
if at iteration i = 0 the chain always starts in 1, i.e. y = (1,0)^^, then the 
distributions of the Markov chain are 



y^P^^ = y^ 

^Tp2fe-Hl ^ ^ 

that is the distributions do not converge. On the other hand the proportions of 
time spent in state 1 and 2 converge to 1/2, 1/2 and we expect N~^ fi^i) 
to be consistent. 
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Fig. 13. In this case the Markov chain cannot explore the complete distribution: this 
is an illustration of reducibility (or in fact here quasi-reducibility) 

The Gibbs Sampler. In the light of the appendix on Markov chains, one can 
ask if the following algorithm is likely to produce samples from the required 
posterior distribution, 



X^\{P,ti,pi) ^ Qa{p^ + a,ti + /?) for 1 < z < 10 

10 

/?|(Ai, . . . , Aio) ~ 0a(7 + 10a, 6 + A^). 

There are many ways of sampling from these unidimensional distribution (in- 
cluding rejection sampling, but there are even much more efficient ways). The 
idea of the Gibbs sampler consists of replacing a difficult global update of 9, 
with successive updates of the components of 0 (or in fact in general groups of 
components of 9). Given the simple and familiar expressions of the conditional 
distributions above, one can suggest the following algorithm 

1. ~ Ga{pi + + /3‘) for 1 < f < 10, 

2. /3‘+i|(A*+\ . . . , \\p) ~ ^a(7 + 10a, 6 -f E - A ^7')- 

Maybe surprisingly, this algorithm produces samples from the posterior dis- 
tribution p{9\y), provided that the required distribution is invariant and the 
Markov chain irreducibility and aperiodicity are satisfied. We start with a re- 
sult, in a simple case for simplicity. The generalization is trivial. 

Proposition 1. p(a, h) 

^ (a,6) , p{a\b) p{b\a) 
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p{a,b) 



From the definition of invariance, we want to prove that for any a' , b\ 



p{a, b)p{a'\b)p{b'\a')dadb = p{a' , b'). 



We start from the left hand side, and apply basic probability rules 



p{a,b)p{a\b)p{b'\a)dadb = / p{b)p{a\b)p{b'\a)db 



= f p{a ,b)p{b'\a)db 

Jx 

= f p{b\a')p{a')p{b'\a')db 

Jx 

= p{a' , b') X 1. 



Now, in order to ensure the convergence of estimators of the type N~^ Sti 
f(Xi), it is sufficient to ensure irreducibility. This is not automatically veri- 
fied for a Gibbs sampler, as illustrated in Figure 14 with a simple example. 
However in the nuclear pumps failure data, irreducibility is automatic: all the 
conditional distributions are strictly positive on the domain of definition of the 
parameters ((0,-|-oo) for each of them). One can therefore reach any set A from 
any starting point x with positive probability in one iteration of the Gibbs 
sampler. 




Fig. 14. A distribution that can lead to a reducible Gibbs sampler 



It is relatively easy to prove aperiodicity as well, but we will not stress 
on this here, as we are in practice mostly interested in estimators of the type 
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Although natural, generally easy to implement, the Gibbs sampler does not 
come without problems. First it is clear that it requires one to be able to iden- 
tify conditional distributions in the model, from which it is routine to sample. 
This is in fact rarely the case with realistic models. It is however generally the 
case when distributions from an exponential family are involved in the mod- 
eling. Another problem of the Gibbs sampler, is that its speed of convergence 
is directly influenced by the correlation properties of the target distribution 
7T. Indeed, consider the toy two-dimensional example in Figure 15. This is a 
bidimensional normal distribution with strong correlation between x and y. A 
Gibbs sampler along the x and y axis will require many iterations to go from 
one point to another point that is far apart, and is somehow strongly con- 
strained by the properties (both in terms of shape and algebraic properties) 
of 7T. 

In contrast the Metropolis-Hastings algorithm which is presented in the next 
subsection possesses an extra degree of freedom, its proposal distribution which 
will determine how tt is explored. This is illustrated in Figure 16, where for a 
good choice of the proposal distribution, the distribution tt is better explored 
than in Figure 15, for the same number of iterations. 




Fig. 15. A distribution for which the Gibbs sampler along the x and y axis might be 
very slow 



The Metropolis-Hastings Algorithm. Let tt be the density of a probabil- 
ity distribution on X and let {0 S X : g (0, •)} be a family of probability densities 
from which it is possible to sample. The Metropolis-Hastings algorithm proceeds 
as follows. 
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Fig. 16. A distribution for which the Gibbs sampler might be very slow, but here 
explored with an appropriate Metropolis-Hastings algorithm 



Metropolis-Hastings Algorithm 

1. Initialization, i = 0. Set randomly or deterministically dg. 

2. Iteration i, i > 1. 

— Propose a candidate 9 ~ q{6i-i, •). 

“ Evaluate the acceptance probability 



0) = min 



7r{9)/q{0,_i,9) } 

’ 7r(6»i_i)/g(0, 0i_i) j 



— Then 0i = 9 with probability a(0i_i,0) otherwise 9i = 9i-\. 



( 8 ) 



Let us assume that we want to simulate a set of samples from p{0\y). 
Using Bayes’ theorem we have p{9\y) oc p{y\9)p{9). A MH procedure consists 
of simulating some candidates 9' according to q{9,9'), evaluating some quanti- 
piv\6')p{e')q{s' ,d) 



ties a (0, 0') = min 
probability a (9,9'). 



1 , 



p(vie)p(g)q(e,e') 



and accepting these candidates with 



As pointed out earlier, g is to a certain extent an extra degree of freedom 
compared to the Gibbs sampler and an infinite number of possible choices for q 
is possible. We here briefly review two classical choices. 



142 



C. Andrieu 



Random Walk: A simple choice consists of proposing as candidate a perturba- 
tion of the current state, i.e. 6' = 9 + z where z is a random increment of density 



— This algorithm corresponds to the particular case q{9,9') = ip {O' — 9). We 
obtain the following acceptance probability: 



a {9, 9') = min 



n{9')p,{9-9') \ 
' 7r{9)p,{9'-9) j 



— If (7 {9, O') = (f {9 — 9') = ip {9' — 9) then we obtain 



a {9, O') = min 



1 , 



tt{ 9) j 



This algorithm is called the Metropolis algorithm [15]. 



(9) 



(10) 



Independent Metropolis-Hastings: In this case, we select the candidate 
independently of the current state according to a distribution ip {9'). Thus 
q {9, O') = ip {O') and we obtain the following acceptance probability: 



a {9, o') = min 



i:{9')ip{9) \ 

^^{9)v{9')l 



( 11 ) 



In the case where tt (6*) l<p{9) is bounded, i.e. we could also apply the ac- 
cept/reject procedure, this procedure shows (fortunately) better asymptotic per- 
formance in terms of variance of ergodic averages. 



In a Bayesian framework, if we want to sample from p{9\y) oc 
p{y\9)p {9) then one can take p {9) as candidate distribution. Then the acceptance 
reduces to 

There are many possible variations on this theme, see [24] and [2]. 



Metropolis-Hastings One-at-a-Time. It should not be surprising if the prob- 
lems encountered with classical sampling techniques are also problems with the 
plain MH algorithm. In particular, when 9 is high-dimensional, it typically be- 
comes very difficult to select a good proposal distribution: either the acceptance 
probability is very low or very large and the chain does not explore tt very 
rapidly, or the chain explores only one mode of the distribution. To solve this 
problem one can use the strategy adopted by the Gibbs sampler. Define a par- 
tition of 0 := (01, ... , Op). Then each component 9k can be updated according 
to a MH update with proposal distribution, say qk which admits the conditional 
distribution Tr{9k\9-k) (where := {9\, . . . ,9k-i,9k+i, ■ ■ ■ ,9p)) as invariant 
distribution. 
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MH One-at-a-Time 

1. Initialization, i = 0. Set randomly or deterministically = 9q. 

2. Iteration i, i>l. 

— For k = \ to p 

— Sample according to a MH step with proposal distribution 

qk{{e%9t-^^),9u) (13) 

and invariant distribution 
End For. 



This algorithm includes the Gibbs sampler as a special case. Indeed, this lat- 
ter corresponds to the particular case where the proposal distributions of the MH 
steps are equal to the full conditional distributions, = 

so that the acceptance probabilities are equal to 1 and no candidate 

is rejected. 



Theoretical Aspects of the MH Algorithm. In this subsection we establish 
that the MH transition probability admits tt as invariant distribution, and then 
briefly discuss the irreducibility and aperiodicity issues. The transition proba- 
bility of the Metropolis-Hastings algorithm is for x, A G X,B{X) 



P{x,A)= / a{x,y)q{x,y)dy + 1a{x) / {1 - a{x,y))q{x,y)dy 
J A Jx 

= / a{x,y)q{x,y)dy + Ia{x)[1 - / a{x,y)q{x,y)dy]. 

J A JX 

We now prove that P is reversible with respect to tt. First notice that 

a{x,y)Tr{x)q{x,y) = minjl, }Tr{x)q{x,y) 

TT{x}q{x,y) 

= min{7r(x)g(x, y),n{y)q{y, a;)} 

/ N / N • rT^(x)q(x,y) 

= T^[y)q(y,x) mm{ — ^,i| 

T^{yjq{y,x} 

= T^{y)q{y,x)a{y,x). 
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Consequently for any A,Bg B{X), 

/ tt{x)P{x, A)dx = / / Tr{x)a{x,y)q{x,y)dxdy 

J B JB JA 

+ / lA{x)Tr{x)[l - / a{x,y)q{x,y)dy]dx 
JB Jx 

= T^{y)q{y,x)a{y,x)dxdy 

J A JB 

+ / l4nB(a;)7r(a:)[l - / a{x,y)q{x,y)dy]dx 

Jx Jx 

= T^{y)q{y,x)a{y,x)dxdy 

J A JB 

+ / I_B(a;)7r(a;)[l - / a{x,y)q{x,y)dy]dx 

JA Jx 

= [ 'x{y)P{y,B)dy. 

JA 

A simple condition which ensures the irreducibility and the aperiodicity of the 
MH algorithm is that q (x,y) is continuous and strictly positive on the support 
of 7T for any x [20]. 
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Abstract. This contribution presents an overview of the theoretical and 
practical aspects of the broad family of learning algorithms based on 
Stochastic Gradient Descent, including Perceptrons, Adalines, K-Means, 
LVQ, Multi-Layer Networks, and Graph Transformer Networks. 



1 Introduction 

This contribution reviews some material presented during the “Stochastic Learn- 
ing” lecture given at the 2003 Machine Learning Summer School in Tubingen. It 
defines a broad family of learning algorithms that can be formalized as stochas- 
tic gradient descent algorithms and describes their common properties. This in- 
cludes numerous well known algorithms such as Perceptrons, Adalines, K-Means, 
LVQ, and Multi-Layer Networks. 

Stochastic learning algorithms are also effective for training large systems 
with rich structure, such as Graph Transformer Networks [8, 24]. Such large 
scale systems have been designed and industrially deployed with considerable 
success. 

— Section 2 presents the basic framework and illustrates it with a number of 
well known learning algorithms. 

— Section 3 presents the basic mathematical tools for establishing the conver- 
gence of stochastic learning algorithms. 

— Section 4 discusses the learning speed of stochastic learning algorithms ap- 
plied to large datasets. This discussion covers both statistical efficiency and 
computational requirements. 

These concepts were previously discussed in [9, 10, 14, 12]. Readers inter- 
ested by the practice of stochastic gradient algorithms should also read [25] and 
investigate applied contributions such as [39, 37, 46, 6, 24, 26]. 



2 Foundations 

Almost all of the early work on focused on online algorithms 

[18, 34, 44, 2, 19]. In these early days, the algorithmic simplicity of online algo- 
rithms was a requirement. This is still the case when it comes to handling large, 
real-life training sets [23, 30, 25, 26]. 
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The early were introduced during the same 

years [33] and very often by the same people [45]. First developed in the engi- 
neering world, recursive adaptation algorithms have turned into a mathematical 
discipline, namely [22, 27, 7]. 

2.1 Expected Risk Function 

In [40, 41], the goal of a learning system consists of finding the minimum of a 
function C{w) named the . This function is decomposed 

as follows: 

C{w) = E,Q{z,w) = jQ{z,w)dP{z) (1) 

The minimization variable w is meant to represent the part of the learning 
system which must be adapted as a response to observing events z occurring 
in the real world. The Q{z,w) measures the performance of the 

learning system with parameter w under the circumstances described by event z. 
Common mathematical practice suggests to represent both w and z by elements 
of adequately chosen spaces W and Z. 

2.2 Gradient Based Learning 

The expected risk function (1) cannot be minimized directly because the grand 
truth distribution is unknown. It is however possible to compute an approxima- 
tion of C{w) by simply using a finite . . of independent observations 

Zi,...,ZL. 

1 ^ 

C{w) « Cl{w) = —'^Q{zn,w) (2) 

n—1 

General theorems [42] show that minimizing the Cl(w) can 

provide a good estimate of the minimum of the expected risk C{w) when the 
training set is large enough. This line of work has provided a way to understand 
the phenomenon, i.e. the ability of a system to learn from a finite 

training set and yet provide results that are valid in general. 

Batch Gradient Descent. Minimizing the empirical risk C'l(w) can be achieved 
using a algorithm. Successive estimates Wt of the optimal 

parameter are computed using the following formula 

1 ^ 

Wt+l = Wt--ftV^CL{wt) = (3) 

^ i=l 

where the learning rate 7 * is a positive number. 



^ The origin of this statistical framework is unclear. It has been popularized by Vap- 
nik’s work [42] but was already discussed in Tsypkin’s work [40] or even [16]. Vapnik 
told me that “someone wrote this on the blackboard during a seminar” ; he does not 
remember who did. 
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The properties of this optimization algorithm are well known: When the 
learning rate 7 * are small enough^, the algorithm converges towards a local min- 
imum of the empirical risk Cl{w). Each iteration of the batch gradient descent 
algorithm however involves a burdening computation of the average of the gra- 
dients of the loss function 'S/ wQ{zn,w) over the entire training set. Significant 
computer resources must be allocated in order to store a large enough training 
set and compute this average. 

Online Gradient Descent. The elementary algorithm 

is obtained by dropping the averaging operation in the batch gradient descent 
algorithm (3). Instead of averaging the gradient of the loss over the complete 
training set, each iteration of the online gradient descent consists of choosing an 
example Zf at random, and updating the parameter Wt according to the following 
formula. 

Wt+I = Wt - (4) 

Averaging this update over all possible choices of the training example Zt 
would restore the batch gradient descent algorithm. The online gradient descent 
simplification relies on the hope that the random noise introduced by this pro- 
cedure will not perturbate the average behavior of the algorithm. Significant 
empirical evidence substantiate this hope. 



World 




Fig. 1. Online Gradient Descent. The parameters of the learning system are updated 
using information extracted from real world observations 



Many variants of (4) have been defined. Parts of this contribution discuss two 
significant variants: Section 2.4 replaces the gradient Vu,Q(z,w) by a general 
term U{z,w) satisfying 'EzU{z,w) = Vu,C'(tc). Section 4 replaces the learning 
rates 7 * by positive symmetric matrices (equation (27).) 

Online gradient descent can also be described without reference to a training 
set. Instead of drawing examples from a training set, we can directly use the 
events Zt observed in the real world, as shown in Figure 1. This formulation 
is particularly adequate for describing that simultaneously 



^ Convergence occurs for constant learning rates, smaller than a critical learning rate 
related to the maximal curvatnre of the cost function. See [25] for instance. 
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process an observation and learn to perform better. Such adaptive algorithms 
are very useful for tracking a phenomenon that evolves in time. 

Formulating online gradient descent without reference to a training set also 
presents a theoretical interest. Each iteration of the algorithm uses an example 
Zt drawn from the grand truth distribution instead of a finite training set. The 
average update therefore is a gradient descent algorithm which directly opti- 
mizes the expected risk. This shortcuts the usual discussion about differences 
between optimizing the empirical risk and the expected risk [42, 43]. Proving 
the convergence of an online algorithm towards the minimum of the expected 
risk provides an alternative to the Vapnik proofs of the consistency of learning 
algorithms. Non-asymptotic bounds for online algorithms are rare. 

2.3 Examples: Online Least Mean Squares 

Widrow’s Adaline. The [44] is one of the few learning systems de- 

signed at the very beginning of the computer age. Online gradient descent was 
then a very attractive proposition requiring little hardware. The adaline could fit 
in a refrigerator sized cabinet containing a forest of potentiometers and electrical 
motors. 

The Adaline (Figure 2) learning algorithm adapts the parameters of a sin- 
gle , . Input patterns x are recognized as class y = +1 or y = —1 

according to the sign of w'x + j3. It is practical to consider an 

pattern x containing an extra constant coefficient equal to 1. The bias j3 
then is represented as an extra coefficient in the parameter vector w. With this 
convention, the output of the threshold unit can be written as 

yw{x) = sign(w'a;) = sign WjXj (5) 




Fig. 2. Widrow’s Adaline. The adaline computes a binary indicator by thresholding a 
linear combination of its input. Learning is achieved using the delta rule 

During training, the Adaline is provided with pairs z = (x, y) representing 
input patterns and desired output for the Adaline. The parameter w is adjusted 
after using the (the “prime” denotes transposed vectors): 



wt+i =wt- 'ytivt - w[xt)'xt 



( 6 ) 
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This delta rule is nothing more than an iteration of the online gradient descent 
algorithm (4) with the following loss function: 

<3adaline(2,w) = {V ~ w' x) (7) 

This loss function does not take the discontinuity of the threshold unit (5) 
into account. This linear approximation is a real breakthrough over the appar- 
ently more natural loss function {y — yyj[x)Y. This discontinuous loss function 
is difficult to minimize because its gradient is zero almost everywhere. Further- 
more, all solutions achieving the same misclassification rate would have the same 
cost C(w), regardless of the margins separating the examples from the decision 
boundary implemented by the threshold unit. 



Multi-layer Networks. ^ were initially designed to over- 

come the computational limitation of the threshold units [29] . Arbitrary binary 
mappings can be implemented by stacking several layers of threshold units, each 
layer using the outputs of the previous layers as inputs. The Adaline linear 
approximation could not be used in this framework, because ignoring the dis- 
continuities would make the entire system linear regardless of the number of 
layers. The key of a learning algorithm for multi-layer networks [35] consisted of 
noticing that the discontinuity of the threshold unit could be represented by a 
smooth non-linear approximation. 

sign(w'a:) « tanh(w'a;) (8) 



Using such does not reduce the computational capabilities of a 

multi-layer network, because the approximation of a step function by a sigmoid 
can be made arbitrarily good by scaling the coefficients of the parameter vector 
w. 

A multi-layer network of sigmoidal units implements a differentiable function 
f{x,w) of the input pattern x and the parameters w. Given an input pattern x 
and the desired network output y, the algorithm, [35] provides 

an efficient way to compute the gradients of the mean square loss function. 






1 

2 



{y- f{x,w)f 



(9) 



Both the batch gradient descent (3) and the online gradient descent (4) have 
been used with considerable success. On large, redundant data sets, the online 
version converges much faster then the batch version, sometimes by orders of 
magnitude [30] . An intuitive explanation can be found in the following extreme 
example. Consider a training set composed of two copies of the same subset. 
The batch algorithm (3) averages the gradient of the loss function over the 
whole training set, causing redundant computations. On the other hand, running 
online gradient descent (4) on all examples of the training set would amount to 
performing two complete learning iterations over the duplicated subset. 
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2.4 Examples: Non Differentiable Loss Functions 



Many interesting examples involve a loss function Q(z,w) which is not differen- 
tiable on a subset of points with probability zero. Intuition suggests that this is 
a minor problems because the iterations of the online gradient descent have zero 
probability to reach one of these points. Even if we reach one of these points, we 
can just draw another example z. 

This can be formalized as replacing the gradient VwQ{z,w) in equation (4) 
by an update term U (z, w) defined as follows: 



U (z, w) 



'^wQ(z,w) when differentiable 
0 otherwise 



( 10 ) 



The convergence study (Section 3) shows that this works if the expectation 
of the update term U(z,w) is equal to gradient of the cost C{w): 



E,C/(z,«;) = V^C{w) 

\/^Q{z,w) dP{z) = [ Q{z,w)dP{z) 



( 11 ) 



The Lebesgue integration theory provides a sufficient condition for swap- 
ping the integration (f) and differentiation (Vuj) operators as in (11). For each 
parameter value w reached by the online algorithm, it is sufficient to find an 
integrable function ^(z, w) and a neighborhood d{w) of w such that: 

Vz, yvGd{w), \Q{z,v) — Q{z,w)\ < \w — v\<P{z,w) (12) 

This condition (12) tests that the maximal slope of the loss function Q{z,w) 
is conveniently bounded. This is obviously true when the loss function Q(z, w) 
is differentiable and has an integrable gradient. This is obviously false when the 
loss function is not continuous. Given our previous assumption concerning the 
zero probability of the non differentiable points, condition (12) is a sufficient 
condition for safely ignoring a few non differentiable points. 



Rosenblatt’s Perceptron. During the early days of the computer age, the 
[34] generated considerable interest as a possible architecture for 
general purpose computers. This interest faded after the disclosure of its com- 
putational limitations [29]. Figure 3 represents the perceptron architecture. An 
produces a feature vector x by applying predefined transforma- 
tions to the input. The feature vector is then processed by a , 

(cd. Adaline). 

The perceptron learning algorithm adapts the parameters w of the threshold 
unit. Whenever a misclassification occurs, the parameters are updated according 
to the 

wt+i =wt + 2jtVt xt (13) 

This learning rule can be derived as an online gradient descent applied to the 
following loss function: 



Qperceptron(-2, w) = (sign('U;'x) - y) w' X 



(14) 
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Associative area 




Treshold element 



sign('w’xj 



Fig. 3. Rosenblatt’s Perceptron is composed of a fixed preprocessing and of a trainable 
threshold unit 

Although this loss function is non differentiable when w' x is null, is meets 
condition (12) as soon as the expectation E(a:) is defined. We can therefore ignore 
the non differentiability and apply the online gradient descent algorithm: 



Since the desired class is either +1 or —1, the weights are not modified when 
the pattern x is correctly classified. Therefore this parameter update (15) is 
equivalent to the perceptron rule (13). 

The perceptron loss function (14) is zero when the pattern x is correctly 
recognized as a member of class ?/ = ±1. Otherwise its value is positive and 
proportional to the dot product w'x. The corresponding cost function reaches 
its minimal value zero when all examples are properly recognized or when the 
weight vector w is null. 

Such , [17, 36] have recently drawn much interest because 

of their links with the Support Vector Machines and the AdaBoost algorithm. 

iiT-Means. The K algorithm [28] is a popular clustering method which 

dispatches K centroids w{k) in order to find clusters in a set of points X\,...,Xl- 
This algorithm can be derived by performing the online gradient descent with 
the following loss function. 



This loss function measures the quantification error, that is to say the er- 
ror on the position of point x when we replace it by the closest centroid. The 
corresponding cost function measures the average quantification error. 

This loss function is not differentiable on points located on the Voronoi 
boundaries of the set of centroids, but meets condition (12) as soon as the ex- 
pectations E(a:) and E(a:^) are defined. On the remaining points, the derivative 
of the loss is the derivative of the distance to the nearest centroid w~ . We can 
therefore ignore the non-differentiable points and apply the online gradient de- 
scent algorithm. 



wt+i = wt- jt(sign(w'tXt) -yt)xt 



(15) 



Qkmeans{x,w) = min {x - W{k))'^ 



imeans 



(16) 
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Fig. 4. TL-Means dispatches a predefined number of cluster centroids in a way that 
minimizes the quantification error 



w. 



t+i 



= Wt ) 



(17) 



This formula describes an elementary iteration of the itT-Means algorithm. A 
very efficient choice of learning rates 7 t will be suggested in Section 4.6. 




Fig. 5. Kohonen’s LVQ2 pattern recognition scheme outputs the class associated with 
the closest reference point to the input pattern 



Learning Vector Quantization 2. Kohonen’s 

(LVQ2) rule [20] is a powerful pattern recognition algorithm. Like AT-Means, 
it uses a fixed set of reference points w{k). A class y(k) is associated with each 
reference point. As shown in Figure 5, an unknown pattern x is then recognized 
as a member of the class associated with the nearest reference point. 

Given a training pattern x, let us denote w~ the nearest reference point and 
denote the nearest reference point among those associated with the correct 
class y. Adaptation only occurs when the closest reference point w~ is associated 
with an incorrect class while the closest correct reference point w'^ is not too far 
away: 
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\ X is misclassified (w 7^ w"*") 

^ land (x — < (1 + ^)(x — 

' _ _ (18) 
then -et(^x-w, ) 

=wl + et[x-wl) 

Reference points are only updated when the pattern x is misclassified. Fur- 
thermore, the distance to the closest correct reference point must not exceed 
the distance to the closest (incorrect) reference point w~ by more than a per- 
centage defined by parameter 6. When both conditions are met, the algorithm 
pushes the closest (incorrect) reference point w~ away from the pattern x, and 
pulls the closest correct reference point w'^ towards the pattern x. 

This intuitive algorithm can be derived by performing an online gradient 
descent with the following loss function: 



Qiv<i2{z,w) = 



! ■ 

6{x—w-)^ 



if X is well classified (u>^ = w ) 
if {x — > (1 + S){x — w~)^ 

otherwise 



(19) 



This function is a continuous approximation to a binary variable indicating 
whether pattern x is misclassified. The corresponding cost function therefore is a 
continuous approximation of the system misclassification rate [9]. This analysis 
helps understanding how the LVQ2 algorithm works. 

Although the above loss function is not differentiable for some values of w, it 
meets condition (12) as soon as the expectations E(x) and E(x^) are defined. We 
can therefore ignore the non-differentiable points and apply the online gradient 
descent algorithm: 



j X is misclassified (w 7 ^ 

( and (x — w+)^ < (1 + ^){x — w~)'^ 

Ihen / ^‘+1 = ) 

i Wt+l = W+ + ltk2{x - W+ ) 



(20) 



with /co = 



^(x — w~Y 



and ki = ^2 



(x — 

(x — W~Y 



(21) 



This online gradient descent algorithm (20) is similar to the usual LVQ2 
learning algorithm (18). The difference between the two scalar coefficients ki 
and /c 2 can be viewed as a minor variation on the learning rates. 



3 Convergence 

Given a suitable choice of the learning rates 74 , the batch gradient descent al- 
gorithm (3) is known to converge to a local minimum of the cost function. This 
local minimum is a function of the initial parameters wq. The parameter trajec- 
tory follows the meanders of the local attraction basin and eventually reaches 
the corresponding minimum. 
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The random noise introduced by stochastic gradient descent (4) disrupts this 
deterministic picture. The parameter trajectory can jump from basin to basin. 
One usually distinguish a that explores the parameter space and a 

that takes place in the vicinity of a minimum. 

— The final phase takes place in the vicinity of a single local minimum w* where 
the cost function is essentially convex. This is discussed in Section 3.1. 

— Our understanding of the search phase is still very spotty. Section 3.2 presents 
sufficient conditions to guarantee that the convergence process will eventu- 
ally reach the final phase. 



3.1 Final Convergence Phase 

The following discussion rely on the . Everywhere 

in the parameter space, the opposite of the gradient must point toward a unique 
minimum w* . 

We > 0, inf {w - w*) Vn,C{w) > 0 (22) 

Such a strong assumption is only valid for a few simple learning algorithms 
such as the Adaline, Section 2.3). Nevertheless these results are useful for un- 
derstanding the final convergence phase. The assumption usually holds within 
the final convergence region because the cost function is locally convex. 

The parameter updates 'ytVwQ(z,w) must become smaller and smaller when 
the parameter vector w approaches the optimum w*. This implies that either 
the gradients or the learning rates must vanish in the vicinity of the optimum. 

More specifically one can write: 



Ez [VyjQ{z,w)" 



E. 



{V^Qiz,w)-V^C{w)Y 



\V^C{w) 



The first term is the variance of the stochastic gradient. It is reasonable to 
assume that it does not grow faster than the norm of the real gradient itself. In 
the vicinity of w* we can write: 

||V„C(u;)f = ||V„C'(u>) - V„C(u>*)f < ^ |lVV„C'(u.*)f \\w - 

It is therefore reasonable to assume that ||Vu,C'(w)|p behaves quadratically 
within the final convergence region. Both assumptions are conveniently expressed 
as follows: 



E^[V^Q{z,wY] < A + B{w-w*f withA>0, B>0 (23) 

The constant A must be greater than the residual variance E^ \ywQ{z, w*)^] 
of the gradients at the optimum. This residual variance can be zero for certain 



® The optimization literature often defines such extended notions of convexity. Small 
details are important. For instance, in (22), one cannot simply replace the infimum 
by Vw A Consider function C{w) = 1 — exp(— ||w|p) as a counter-example. 
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rare noiseless problems where w* simultaneously optimizes the loss for every 
examples. It is strictly positive in most practical cases. The average norm of 
the gradients then does not vanish when the parameter vector w approaches the 
optimum. Therefore one must use , e.g.: 



<oo (24) 

The presence of constant A in (23) marks a critical difference between stochas- 
tic and ordinary gradient descent. There is no such constant in the case of the 
ordinary gradient descent. A simple analysis then yields an expression for the 
maximal constant learning rate [25]. In the stochastic gradient case, this analysis 
suggests that the parameter vector eventually hovers around the minimum w* 
at a distance roughly proportional to 74 . Quickly decreasing the learning rate 
is therefore tempting. Suppose however that the learning rates decrease so fast 
that ^ 74 = i? < 00 . This would effectively maintain the parameters within 
a certain radius of their initial value. It is therefore necessary to enforce the 
following condition: 

^ 7 t = oo (25) 



Convex Convergence Theorem. ( ) 

()()( ) 

( ) w* 

The following discussion provides a sketch of the proof. This proof is simply 
an extension of the convergence proofs for the continuous gradient descent and 
the discrete gradient descent. 

The continuous gradient descent proof studies the convergence of the function 
w{t) defined by the following differential equation: 

dw „ , 

= ~^wC{w) 

This proof follows three steps: 



A) — A Lyapunov function is a function 
whose convergence to zero implies the convergence of w(t) to w* when t grows 
to the infinity. For the continuous gradient we simply use h{t) = (w — w*)'^ . 

B) , — Using (22), it is easy to see that 

dh/dt = 2{w — w*)\7jjjC{w) < 0. Function h{t) converges because it is both 
positive and decreasing. 

C) . , We know that dh/dt 0 

because h{t) converges. Assumption (22) then implies that {w — w*)'^ — > 0. 

The convergence proofs for both the discrete (3) and stochastic (4) gradient 
descents follow the same three steps. Each step requires increasingly sophisti- 
cated mathematical tools summarized in the following table. 
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Continuous 


Discrete 


Stochastic 


Step A 

Define Lyapunov 
criterion. 


Function 
h(t) — {w{t) — 


Sequence 
ht = (wt - w*)'^ 


Random Process 
ht = (wt — 


Step B 

Lyapunov criterion 
converges. 


Decreasing positive 
function 


Positive sequence 
with bounded 
positive increments 


Positive 

quasi-martingales 


Step C 

Lyapunov criterion 
converges to zero. 


General Convexity 



Full details can be found in [9, 10]. 

3.2 Search Phase 

This section discusses the convergence of the stochastic gradient algorithm (4) 
without the general convexity assumption (22). Since the cost function C{w) 
can have several local minima, this discussion encompasses the search phase. 
Although our understanding of the search phase is still very incomplete, empir- 
ical and theoretical evidence indicate that stochastic gradient algorithms enjoy 
significant advantages over batch algorithms. Stochastic gradient descent benefit 
from the redundancies of the training set. Consider the extreme case where a 
training set of size 1000 is inadvertently composed of 10 identical copies of a 
set with 100 samples. Averaging the gradient over all 1000 patterns gives the 
exact same result as computing the gradient based on just the first 100. Batch 
gradient descent is wasteful because it re-computes the same quantity 10 times 
before one parameter update. On the other hand, stochastic gradient will see a 
full epoch as 10 iterations through a 100-long training set. 

In the case of the continuous and discrete gradient descent, it is usually 
sufficient to partition the parameter space into several attraction basins, discuss 
the conditions under which the algorithm confines the parameters Wt in a single 
attraction basin, define a suitable Lyapunov criterion [21], and proceed as in the 
convex case. This approach does not work well with stochastic gradient because 
the parameter trajectory can always jump from basin to basin. 

Let us instead assume that the cost function becomes large when one wanders 
far from the origin. The global landscape then looks like a single large attraction 
basin. The local minima structure only shows when one gives a closer look to in 
the vicinity of the apparent minimum. 

This situation can be expressed by the following assumptions: 

i.) inf C{w) > — oo 
ii.) 3D > 0, inf w\7^C(w)>0 

w^>D 

in.) EiziVwQiz, w))^ < A -I- Bw^ 

iv.) 3E>D,yz, sup \\SI .^Q{z,w)\\ < Constant 

w^<E 
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Assumption (i) indicates that the cost is bounded from below. Assumption 
(ii) indicates that the gradient far away from the origin always drives us back 
towards the origin. Assumptions (iii) and (iv) limit the variance of the stochastic 
gradient and the asymptotic growth of the real gradients^. 



Global Confinement Theorem: 

( ) 



Ve 



( ) 



(i) (iv) 

Wt 



This global confinement result [10] is obtained using the same proof tech- 
nique as in the convex case. The Lyapunov criterion is simply defined as ht = 
max(A, wf). 

Global confinement shows that Wt evolves in a compact domain where nothing 
dramatic can happen. In fact, it even implies that the stochastic gradient descent 
will soon or later reach the final convergence phase. This is formalized by the 
following result: 

Gradient Convergence Theorem. (z) (iv) 

( ) ( ) 

^wC{wt) 

The proof of this final convergence result [10] again is very similar to the con- 
vergence proofs for the convex case with suitable choices for the Lyapunov crite- 
rion. The details of the proof extensively rely on the global confinement result. 



4 Convergence Speed and Learning Speed 

The main purpose of this section is to illustrate a critical difference between op- 
timization algorithms and learning algorithm. It will then appear that stochastic 
gradient descent is simultaneously a very poor optimization algorithm and a very 
effective learning algorithm. 

4.1 Convergence Speed for Batch Gradient Descent 

Simple batch gradient descent enjoy ® convergence speed (see for instance 
Section 5 of [25]). The convergence speed of batch gradient descent drastically 
improves when one replaces the scalar learning rates 7 ^ by a definite positive 
symmetric matrix <Pt that approximates the inverse of the Hessian of the cost 
function. 

H{w) = VV^C{w) (26) 

This leads to very effective optimization algorithms such as Newton’s al- 
gorithm, Levemberg-Marquardt, Conjugate Gradient and BFGS (see [15] for a 

^ See also the discussion for convex assumption (23). 

® Linear convergence speed: (logl/(wt — w*)^) grows linearly with t. 
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review). These algorithms achieve or even ® convergence 

speeds. 



4.2 Convergence Speed for Stochastic Algorithms 

Whereas online algorithms may converge to the general area of the optimum at 
least as fast as batch algorithms [25], the optimization proceeds rather slowly 
during the final convergence phase [14]. The noisy gradient estimate causes the 
parameter vector to fluctuate around the optimum in a bowl whose size depends 
on the decreasing learning rates and is therefore constrained by (25). It can be 
shown that this size decreases like 1/t at best^. 

Stochastic gradient descent nevertheless benefits from using similar second 
order methods. The gradient vector is rescaled using a positive symmetric matrix 
that approximates the inverse hessian (26) in a manner analogous to Newton’s 
algorithm®. The same convergence results apply as long as the eigenvalues of the 
scaling matrix <l>t are bounded. 



Wt+l = Wt - ~^t'^wQ{zt,Wt) 



(27) 



For simplicity, this section only addresses the case y* = 1/t which satisfies 
both conditions (24) and (25). It is however important to understand that the 
second order stochastic algorithm (27) still experiences the stochastic noise re- 
sulting from the random selection of the examples Zt- Its convergence speed still 
depends on the choice of decreasing learning rates jt and is therefore constrained 
by condition (25). This is a sharp contrast with the case of batch algorithms 
where the same scaling matrix yields superlinear convergence. 

Stochastic gradient descent is a hopeless optimization algorithm. It is tempt- 
ing to conclude that it is also a very poor learning algorithm. Yet experience 
suggests otherwise [4]. 



4.3 Optimization Versus Learning 

This apparent contradiction is resolved when one considers that the above dis- 
cussion compares the speed of two different convergences: 

— Batch algorithms converge towards a minimum of the . . Cl{w), 

which is defined as an average on L training examples (2). 

— Stochastic algorithms converge towards a minimum of the 

C{w), which is defined as an expectation with respect to the probability 
distribution from which we draw the examples ( 1) . 



® Quadratic convergence speed: (loglog l/(i!;t — w*)^) grows linearly with t. 

Convergence speed of stochastic gradient: (l/(wt — w*)^) grows linearly with t. 

® Such second order stochastic approximations are standard practice in the Stochastic 
Approximation literature [22, 27, 7]. 
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In a learning problem, we are interested in knowing the speed of convergence 
towards the minimum of the expected risk C{w) because it reflects the gener- 
alization error. Replacing the expected risk C{w) by the empirical risk Cl{w) 
is by itself an approximation. As shown in the next section, this approximation 
spoils the potential benefits of running an optimization algorithm with ambitious 
convergence speed. 



4.4 Optimizing the Empirical Risk Is a Stochastic Process 

We consider in this section an infinite sequence of independent training examples 
(zi, . . . , Zt, . . . ). Let wl be the minimum of the empirical risk Ct{w) defined on 
a training set composed of the first t examples (zi, . . . , Zt). We assume that all 
the are located in the vicinity of the minimum w* of the expected risk C{w). 

Manipulating a Taylor expansion of the gradient of Ct+i(w) in the vicinity 
of provides the following recursive relation: 



w 



*■ 

t+i 



w. 



1 

t 1 



I't^n.QizuwD + O 




(28) 



with 



= 



t+i 



t -\- 



1 ^ / t^OO 



The similarity between (28) and (27) suggests that both the batch sequence 
(wj) and online sequence {wt) converge at the same speed for adequate choices 
of the scaling matrix Theoretical analysis indeed shows that [31, 13]: 



E [ « - w*f 




(29) 



^ E [(«;*- u;*)2 ]= + (30) 

where 

K = ir&ce{H-^{w*) ■ E, [(V„Q(z, w*)) (V„Q(z, w*))'] • H~\w*)) 

Not only does this result establish that both sequences have O {1/t) conver- 
gence, but also it provides the value of the common constant K . This constant is 
neither affected by the second order terms of (28) nor by the convergence speed 
of the scaling matrix towards the inverse Hessian [13]. 

Following [40], we say that a second order stochastic algorithm is 
when <!>t converges to H~^{w*). Figure 6 summarizes the behavior of such opti- 
mal algorithms. After t iterations on fresh examples, the point Wt reached by an 
optimal stochastic learning algorithm is asymptotically as good as the solution 
Wj of a batch algorithm trained on the same t examples. 
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w* True solution. 




E[{wt-w*f] ~ E[K-w*) 2] ~ f 



Fig. 6. After t iterations on fresh examples, the point wt reached by an optimal 
stochastic learning algorithm is asymptotically as good as the solution w* of a batch 
algorithm trained on the same t examples 

4.5 Comparing Computational Complexities 

The discussion so far has established that a properly designed online learning 
algorithm performs as well as any batch learning algorithm for a same number 
of examples. We now establish that, given the same computing resources, a 
stochastic learning algorithm can asymptotically process more examples than a 
batch learning algorithm. 

Each iteration of a batch learning algorithm running on N training examples 
requires a time KiN + K 2 - Constants Ki and K 2 respectively represent the 
time required to process each example, and the time required to update the 
parameters. Result (29) provides the following asymptotic equivalence: 

E [ - u;*)" ] ^ 

The batch algorithm must perform enough iterations to approach the em- 
pirical optimum w'^ with at least the same accuracy (~ 1/.^). A very efficient 
algorithm with quadratic convergence achieves this after a number of iterations 
asymptotically proportional to (log log iV). 

Running a stochastic learning algorithm requires a constant time K 3 per 
processed example. Let us call T the number of examples processed by the 
stochastic learning algorithm using the same computing resources as the batch 
algorithm. We then have: 

K^T ~ (RTiiV + K 2 )loglogN T N log log N 

The parameter wt of the stochastic algorithm also converges according to 
(30). Comparing the accuracies of both algorithms shows that the stochastic 
algorithm asymptotically provides a better solution by a factor ~ (log log iV). 

= 1 - »•)= 1 ~ « T ~ E [(»;,- »-)q (31) 

This (log log iV) factor corresponds to the number of iterations required by 
the batch algorithm. This number increases slowly with the desired accuracy 
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of the solution. In practice, this factor is much less significant than the actual 
value of the constants Ki, K 2 and K^. Experience shows however that stochastic 
algorithms are considerably easier to implement. Each iteration of the batch 
algorithm involves a large summation over all the available examples. Memory 
must be allocated to hold these examples. On the other hand, each iteration of 
the stochastic algorithm only involves one random example which can then be 
discarded. 

4.6 Examples 

Optimal Learning Rate for i^-Means. Second derivative information can 
be used to determine very efficient learning rates for the it'-Means algorithm 
(Section 2.4). A simple analysis of the loss function (16) shows that the Hessian 
of the cost function is a diagonal matrix [11] whose coefficients A(fc) are equal to 
the probabilities that an example x is associated with the corresponding centroid 

W(k). 

These probabilities can be estimated by simply counting how many examples 
n{k) have been associated with each centroid w(k). Each iteration of the corre- 
sponding stochastic algorithm consists in drawing a random example Xt , finding 
the closest centroid w(k), and updating both the count and the centroid with the 
following equations: 

' nt+i{k) = nt{k) + 1 

Wt+l(k) = Wt(k) + {Xt - Wt{k)) ^ ’ 

Algorithm (32) very quickly locates the relative position of clusters in the 
data. Terminal convergence however is slowed down by the noise implied by the 
random choice of the examples. Experimental evidence [11] suggest that the best 
optimization speed is achieved by first using the stochastic algorithm (32) and 
then switching to a batch super-linear version of AT-means. 

Kalman Algorithms. The Kalman filter theory has introduced an efficient 
way to compute an approximation of the inverse of the Hessian of certain cost 
functions. This idea is easily demonstrated in the case of linear algorithms such 
as the Adaline (Section 2.3). Consider stochastic gradient descent applied to the 
minimization of the following mean square cost function: 

C{w) = j Q{z,w)dP{z) with Q{z,w) = {y — w'xY (33) 

Each iteration of this algorithm consists of drawing a new pair zt = (xt,yt) 
from the distribution dP{z) and applying a parameter update formula similar 
to (27): 

wt+i = Wt- VwQ{zt,Wt) = wt- {yt - w[xt)' Xt (34) 

where Ht denotes the Hessian of an empirical estimate Ct (w) of the cost function 
C(w) based on the examples zi, ... ,Zt observed so far. 
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Ct{w) 



A 1 
2 



Ht 



II 


\'^iyi-w'xif 


(35) 


2=1 


^ 2=1 




= VlCtiw) = 


t 


(36) 



2 = 1 



Directly computing the matrix at each iteration would be very expen- 
sive. We can take advantage however of the recursion Ht = i?t_i -I- Xfx't using 
the well known matrix equality: 



{A + = A-i - {A~^B) (/ -h B'A-^B)-^ {A-^B)' (37) 



Algorithm (34) then can be rewritten recursively using the Kalman matrix 
Kt = The resulting algorithm (38) converges much faster than the delta 

rule (6) and yet remains quite easy to implement: 



Kt+i = Kt- 



{KtXt){KtXtY 

1 -I- x'tKtXt 



wt+i =wt- Kt+i {yt - w'tXt)'xt 



(38) 



Gauss Newton Algorithms. Non linear least mean square algorithms, such as 
the multi-layer networks (Section 2.3) can also benefit from non-scalar learning 
rates. The idea consists of using an approximation of the Hessian matrix. The 
second derivatives of the loss function (9) can be written as: 

^ {y - fix, w)f = V„/(x, w) WYfix, w)-{y- f{x, w))Wlf{x, w) 

~Vn,fix,w)VYf{x,w) (39) 

Approximation (39), known as the ^ , neglects 

the impact of the non linear function / on the curvature of the cost function. 
With this approximation, the Hessian of the empirical stochastic cost takes a 
very simple form. 



t 

Ht{w) « '^\/i,f{xi,w)\/Yfixi,w) (40) 

i=l 

Although the real Hessian can be negative, this approximated Hessian is 
always positive, a useful property for convergence. Its expression (40) is reminis- 
cent of the linear case (36). Its inverse can be computed using similar recursive 
equations. 

Natural Gradient. Information geometry [I] provides an elegant description 
of the geometry of the cost function. It is best introduced by casting the learning 
problem as a density estimation problem. A multilayer perceptron f(x,w), for 
instance, can be regarded as a parametric regression model y = f{x, w) -I- e 
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where e represents an additive Gaussian noise. The network function f(x,w) 
then becomes part of the Gaussian location model: 



p{z\w) 



Ca- exp 



{y- f{x,w)f 

2(j2 



(41) 



The optimal parameters are found by minimizing the Kullback-Leibler di- 
vergence between p(z\w) and the ground truth P{z). This is equivalent to the 
familiar optimization of the mean square loss (9): 



Ez log 



P{z) 

p{z\w) 



~^'EzQrase{z,w) + Gonstant 
( 7 ^ 



(42) 



The essential idea consists of endowing the space of the parameters w with 
a distance that reflects the proximity of the distributions p{z\w) instead of the 
proximity of the parameters w. Multilayer networks, for instance, can implement 
the same function with very different weights vectors. The new distance distorts 
the geometry of the parameter space in order to represent the closeness of these 
weight vectors. 

The infinitesimal distance between distributions p{z\w) and p{z\w + dw) can 
be written as follows: 



D{w\\w + dw) « dw'Q{w)dw (43) 

where Q{w) is the Fisher Information matrix: 

g[w) = j w'^ogp{z\w)\I w\ogp{z\w)') p{z\w)dz 

The determinant |tj(w)| of the Fisher information matrix usually is a smooth 
function of the parameter w. The parameter space is therefore composed of 
Riemannian domains where |fj('u;)| yf 0 separated by critical sub-spaces where 
|5(rc)| = 0. 

The algorithm [3] provides a principled way to search a 

Riemannian domain. The gradient VwC{w) defines the steepest descent direction 
in the Euclidean space. The steepest descent direction in a Riemannian domain 
differs from the Euclidexan one. It is defined as the vector dw which maximizes 
C{w) — C{w + dw) in the (5-neighborhood: 

D{w\\w + dw) « dw'Q{w)dw < 6. (44) 

A simple derivation then shows that multiplying the gradient by the inverse 
of the Fisher Information matrix yields the steepest Riemannian direction. The 
Natural Gradient algorithm applies the same correction to the stochastic gradi- 
ent descent algorithm (4): 

wt+i = Wt-^tQ~^{wt)'^wQ{z,Wt), ( 45 ) 

The similarity between the update rules (27) and (45) is obvious. This link 
becomes clearer when the Fisher Information matrix is written in Hessian form. 
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g{w) = J - {Vllogp{z\w)) p{z\w)dz 

where denotes a second derivative. When the parameter approaches the 
optimum, distribution p(z\w) becomes closer to the ground truth dP{z), and the 
Fisher Information matrix Q{w) aligns with the Hessian matrix 'V^EzQ(z,w). 
The natural gradient asymptotically behaves like a second order algorithm. 

Remark. The above algorithms are all derived from (27) and suffer from the 
same limitation. The number of coefficients in matrix <Pt scales like the square 
of the number of parameters. Manipulating such large matrices often requires 
excessive computer time and memory. 

Result (30) holds when <Pt — *■ H~^{w*). This implies that d>t must be a full 
rank approximation of H~^. Suppose instead that <Pt converges to a more eco- 
nomical approximation of involving a limited number of coefficients. With 
a proper choice of learning rates jt , such an approximate second order stochas- 
tic gradient algorithm keeps the 0{l/t) behavior (30) with a worse constant 
K\ > K. Such a stochastic algorithm will eventually outperform batch algo- 
rithms because (log log iV) will eventually become larger than the ratio K^/K. 
In practice this can take a very long time. . . 

Approximate second order stochastic algorithms are still desirable because it 
might be simply impossible to simply store a full rank matrix <l>t, and because 
manipulating the approximation of the Hessian might bring computational gains 
that compare well with ratio Kp^/ K. The simplest approximation [5] involves a 
diagonal approximation of More sophisticated schemes [32, 38] attempt to 
approximate the average value of yjQ{z,Wt) using simpler calculations for 
each example. 

5 Conclusion 

A broad family of learning algorithms can be formalized as stochastic gradient 
descent algorithms. It includes numerous well known algorithms such as Per- 
ceptrons, Adalines, K-Means, LVQ, and Multi-Layer Networks as well as more 
ambitious learning systems such as Graph Transformer Networks. 

All these algorithms share common convergence properties. In particular, 
stochastic gradient descent simultaneously is a very poor optimization algorithm 
and a very effective learning algorithm. 
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Abstract. The goal of statistical learning theory is to study, in a sta- 
tistical framework, the properties of learning algorithms. In particular, 
most results take the form of so-called error bounds. This tutorial intro- 
duces the techniques that are used to obtain such results. 



1 Introduction 

The main goal of statistical learning theory is to provide a framework for study- 
ing the problem of inference, that is of gaining knowledge, making predictions, 
making decisions or constructing models from a set of data. This is studied in a 
statistical framework, that is there are assumptions of statistical nature about 
the underlying phenomena (in the way the data is generated). 

As a motivation for the need of such a theory, let us just quote V. Vapnik: 

(Vapnik, [1]) Nothing is more practical than a good theory. 

Indeed, a theory of inference should be able to give a formal definition of 
words like learning, generalization, overfitting, and also to characterize the per- 
formance of learning algorithms so that, ultimately, it may help design better 
learning algorithms. 

There are thus two goals: make things more precise and derive new or im- 
proved algorithms. 



1.1 Learning and Inference 

What is under study here is the process of inductive inference which can roughly 
be summarized as the following steps: 
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1. Observe a phenomenon 

2. Construct a model of that phenomenon 

3. Make predictions using this model 

Of course, this definition is very general and could be taken more or less 
as the goal of Natural Sciences. The goal of Machine Learning is to actually 
this process and the goal of Learning Theory is to it. 

In this tutorial we consider a special case of the above process which is the 
supervised learning framework for pattern recognition. In this framework, the 
data consists of instance-label pairs, where the label is either -|-1 or —1. Given a 
set of such pairs, a learning algorithm constructs a function mapping instances to 
labels. This function should be such that it makes few mistakes when predicting 
the label of unseen instances. 

Of course, given some training data, it is always possible to build a function 
that fits exactly the data. But, in the presence of noise, this may not be the 
best thing to do as it would lead to a poor performance on unseen instances 
(this is usually referred to as overfitting) . The general idea behind the design of 
learning algorithms is thus to look for (in a sense to be defined later) 

in the observed phenomenon (i.e. training data). These can then be 
from the observed past to the future. Typically, one would look, in a collection 
of possible models, for one which fits well the data, but at the same time is as 
simple as possible (see Figure 1). This immediately raises the question of how 
to measure and quantify simplicity of a model (i.e. a {—1, -1-1}- valued function). 




Fig. 1. Trade-off between fit and complexity 



It turns out that there are many ways to do so, but no best one. For example 
in Physics, people tend to prefer models which have a small number of constants 
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and that correspond to simple mathematical formulas. Often, the length of de- 
scription of a model in a coding language can be an indication of its complexity. 
In classical statistics, the number of free parameters of a model is usually a 
measure of its complexity. Surprisingly as it may seem, there is no universal way 
of measuring simplicity (or its counterpart complexity) and the choice of a spe- 
cific measure inherently depends on the problem at hand. It is actually in this 
choice that the designer of the learning algorithm introduces knowledge about 
the specific phenomenon under study. 

This lack of universally best choice can actually be formalized in what is 
called the theorem, which in essence says that, if there is no 

assumption on how the past (i.e. training data) is related to the future (i.e. test 
data), prediction is impossible. Even more, if there is no a priori restriction on 
the possible phenomena that are expected, it is impossible to generalize and 
there is thus no better algorithm (any algorithm would be beaten by another 
one on some phenomenon). 

Hence the need to make assumptions, like the fact that the phenomenon we 
observe can be explained by a simple model. However, as we said, simplicity is 
not an absolute notion, and this leads to the statement that data cannot replace 
knowledge, or in pseudo-mathematical terms: 

Generalization = Data -I- Knowledge 

1.2 Assumptions 

We now make more precise the assumptions that are made by the Statistical 
Learning Theory framework. Indeed, as we said before we need to assume that 
the future (i.e. test) observations are related to the past (i.e. training) ones, so 
that the phenomenon is somewhat stationary. 

At the core of the theory is a probabilistic model of the phenomenon (or data 
generation process). Within this model, the relationship between past and future 
observations is that they both are sampled independently from the same distri- 
bution (i.i.d.). The independence assumption means that each new observation 
yields maximum information. The identical distribution means that the obser- 
vations give information about the underlying phenomenon (here a probability 
distribution) . 

An immediate consequence of this very general setting is that one can con- 
struct algorithms (e.g. /c-nearest neighbors with appropriate k) that are 

, which means that, as one gets more and more data, the predictions of the 
algorithm are closer and closer to the optimal ones. So this seems to indicate that 
we can have some sort of universal algorithm. Unfortunately, any (consistent) 
algorithm can have an arbitrarily bad behavior when given a finite training set. 
These notions are formalized in Appendix B. 

Again, this discussion indicates that generalization can only come when one 
adds specific knowledge to the data. Each learning algorithm encodes specific 
knowledge (or a specific assumption about how the optimal classifier looks like), 
and works best when this assumption is satisfied by the problem to which it is 
applied. 
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Bibliographical Remarks. Several textbooks, surveys, and research mono- 
graphs have been written on pattern classification and statistical learning the- 
ory. A partial list includes Anthony and Bartlett [2], Breiman, Friedman, Olshen, 
and Stone [3], Devroye, Gyorfi, and Lugosi [4], Duda and Hart [5], Fukunaga [6], 
Kearns and Vazirani [7], Kulkarni, Lugosi, and Venkatesh [8], Lugosi [9], McLach- 
lan [10], Mendelson [11], Natarajan [12], Vapnik [13, 14, 1], and Vapnik and 
Chervonenkis [15]. 

2 Formalization 

We consider an input space A and output space y. Since we restrict ourselves 
to binary classification, we choose y = {—1,1}. Formally, we assume that the 
pairs (X,Y) G A x y are random variables distributed according to an 
distribution P. We observe a sequence of n i.i.d. pairs (A*, Yi) sampled according 
to P and the goal is to construct a function g : X ^ y which Y from 

A. 

We need a criterion to choose this function g. This criterion is a low proba- 
bility of error P{g{X) yf Y). We thus define the . of g as 

R{g) = P{g{X) ^Y)=E ■ 

Notice that P can be decomposed as Px x P{Y\X). We introduce the 
r]{x) = E [A] A = x] = 2P [A = IjA = x] — l and the 
(or Bayes classifier) t{x) = sgnr 7 (x). This function achieves the minimum risk 
over all possible measurable functions: 

R{t) = inf R{g) . 

9 

We will denote the value R{t) by R* , called the Bayes risk. In the determin- 
istic case, one has A = t{X) almost surely (P [A = IjA] e (0, 1}) and R* = 0. In 
the general case we can define the as s(cc) = min(P [A = IjA = x] , 1 — 

P [A = IjA = x\) = (1 — r]{x))/2 (s(A) = 0 almost surely in the deterministic 
case) and this gives R* = Es(A). 

Our goal is thus to identify this function t, but since P is unknown we cannot 
directly measure the risk and we also cannot know directly the value of t at the 
data points. We can only measure the agreement of a candidate function with 
the data. This is called the . : 



1 X ^ 

‘ i=l 

It is common to use this quantity as a criterion to select an estimate of t. 

2.1 Algorithms 

Now that the goal is clearly specified, we review the common strategies to (ap- 
proximately) achieve it. We denote by the function returned by the algorithm. 
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Because one cannot compute R{g) but only approximate it by Rn{g), it would 
be unreasonable to look for the function minimizing Rn{g) among all possible 
functions. Indeed, when the input space is infinite, one can always construct a 
function which perfectly predicts the labels of the training data (i.e. g„{Xi) = 
Yi, and Rn{gn) = 0), but behaves on the other points as the opposite of the target 
function t, i.e. gn{X) = —Y so that R{gn) = 1^- So one would have minimum 
empirical risk but maximum risk. 

It is thus necessary to prevent this overfitting situation. There are essentially 
two ways to do this (which can be combined). The first one is to restrict the 
class of functions in which the minimization is performed, and the second is to 
modify the criterion to be minimized (e.g. adding a penalty for ‘complicated’ 
functions) . 

Empirical Risk Minimization. This algorithm is one of the most straight- 
forward, yet it is usually efficient. The idea is to choose a G of possible 

functions and to minimize the empirical risk in that model: 

gn = aigmin Rn{g) ■ 
g&G 

Of course, this will work best when the target function belongs to Q. However, 
it is rare to be able to make such an assumption, so one may want to enlarge 
the model as much as possible, while preventing overfitting. 

Structural Risk Minimization. The idea here is to choose an infinite se- 
quence {Gd '■ d = 1,2,...} of models of increasing size and to minimize the 
empirical risk in each model with an added penalty for the size of the model: 

gn = arg min Rn{g) + pen{d, n) . 

se6d,deN 

The penalty pen(d, n) gives preference to models where estimation error is 
small and measures the size or of the model. 

Regularization. Another, usually easier to implement approach consists in 
choosing a large model G (possibly dense in the continuous functions for ex- 
ample) and to define on a , typically a norm H^lj. Then one has to 

minimize the regularized empirical risk: 

gn = argmini?„( 5 ) -h A || 5 ||^ . 
g&O 

Compared to SRM, there is here a free parameter A, called the 

which allows to choose the right trade-off between fit and complexity. 



^ Strictly speaking this is only possible if the probability distribution satishes some 
mild conditions (e.g. has no atoms). Otherwise, it may not be possible to achieve 
R{gn) ~ 1 but even in this case, provided the support of P contains inhnitely many 
points, a similar phenomenon occurs. 
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Tuning A is usually a hard problem and most often, one uses extra validation 
data for this task. 

Most existing (and successful) methods can be thought of as regularization 
methods. 



Normalized Regularization. There are other possible approaches when the 
regularizer can, in some sense, be ‘normalized’, i.e. when it corresponds to some 
probability distribution over Q . 

Given a probability distribution tt defined on Q (usually called a prior), one 
can use as a regularizer — log7r((7)^. Reciprocally, from a regularizer of the form 
\\g\\'^, if there exists a measure g on Q such that f d/j(g) < oo for some 

A > 0, then one can construct a prior corresponding to this regularizer. For 
example, if Q is the set of hyperplanes in going through the origin, Q can be 
identified with R“* and, taking g as the Lebesgue measure, it is possible to go 
from the Euclidean norm regularizer to a spherical Gaussian measure on R“* as 
a prior^. 

This type of normalized regularizer, or prior, can be used to construct another 
probability distribution p on ^ (usually called posterior), as 



p{g) 



p-lRn(g) 



where 7 > 0 is a free parameter and ^(7) is a normalization factor. 

There are several ways in which this p can be used. If we take the function 
maximizing it, we recover regularization as 



argmaxp(p) = argmin7R„(p) - logTr(p) , 
g&G g&G 

where the regularizer is —7”^ logTr(p)'^. 

Also, p can be used to the predictions. In that case, before com- 

puting the predicted label for an input x, one samples a function g according to 
p and outputs g{x). This procedure is usually called Gibbs classification. 

Another way in which the distribution p constructed above can be used is by 
taking the expected prediction of the functions in Q : 



gn(x) = sgn{E p{g{x))) . 
This is typically called Bayesian averaging. 



^ This is fine when Q is countable. In the continuous case, one has to consider the 
density associated to tt. We omit these details. 

® Generalization to infinite dimensional Hilbert spaces can also be done but it requires 
more care. One can for example establish a correspondence between the norm of a 
reproducing kernel Hilbert space and a Gaussian process prior whose covariance 
function is the kernel of this space. 

^ Note that minimizing jRn{g) — log 71 ( 5 ) is equivalent to minimizing Rn(g) — 
7"Mog7r(p). 
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At this point we have to insist again on the fact that the choice of the class Q 
and of the associated regularizer or prior, has to come from knowledge 

about the task at hand, and there is no universally best choice. 

2.2 Bounds 

We have presented the framework of the theory and the type of algorithms that 
it studies, we now introduce the kind of results that it aims at. The overall goal is 
to characterize the risk that some algorithm may have in a given situation. More 
precisely, a learning algorithm takes as input the data (Ai, Yi), . . . , (A„, Y„) and 
produces a function g„ which depends on this data. We want to estimate the 
risk of gn- However, R{gn) is a random variable (since it depends on the data) 
and it cannot be computed from the data (since it also depends on the unknown 
P). Estimates of R{gn) thus usually take the form of probabilistic bounds. 

Notice that when the algorithm chooses its output from a model Q, it is 
possible, by introducing the best function g* in Q, with R{g*) = infggg R{g), to 
write 

A(5„) -R* = [R{g*) - R*] + [R{g^) - R{g *)\ . 

The first term on the right hand side is usually called the approximation 
error, and measures how well can functions in Q approach the target (it would 
be zero The second term, called estimation error is a random quantity 

(it depends on the data) and measures how close is to the best possible choice 
in Q. 

Estimating the approximation error is usually hard since it requires knowl- 
edge about the target. Classically, in Statistical Learning Theory it is preferable 
to avoid making specific assumptions about the target (such as its belonging to 
some model), but the assumptions are rather on the value of R* , or on the noise 
function s. 

It is also known that for any (consistent) algorithm, the rate of convergence 
to zero of the approximation error^ can be arbitrarily slow if one does not make 
assumptions about the regularity of the target, while the rate of convergence 
of the estimation error can be computed without any such assumption. We will 
thus focus on the estimation error. 

Another possible decomposition of the risk is the following: 

R{.9n) — Rni^dn) ~t“ \R{,gvL^ Rni,9n)\ • 

In this case, one estimates the risk by its empirical counterpart, and some 
quantity which approximates (or upper bounds) R{gn) — Rn{gn)- 

To summarize, we write the three type of results we may be interested in. 

— : R{gn) < Rn{gn) + B{n^ Q). This corresponds to the estimation 

of the risk from an empirical quantity. 



® For this converge to mean anything, one has to consider algorithms which choose 
functions from a class which grows with the sample size. This is the case for example 
of Structural Risk Minimization or Regularization based algorithms. 
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■■ R{gn) < R{g*) + B{n,G). This 
tells how ’’optimal” is the algorithm given the model it uses. 

— : R{gn) < R* + B{n,G)- This gives 

theoretical guarantees on the convergence to the Bayes risk. 

3 Basic Bounds 

In this section we show how to obtain simple error bounds (also called general- 
ization bounds). The elementary material from probability theory that is needed 
here and in the later sections is summarized in Appendix A. 

3.1 Relationship to Empirical Processes 

Recall that we want to estimate the risk = E of the function 

g„ returned by the algorithm after seeing the data (Ai, Yi), . . . , (A„, Y„). This 
quantity cannot be observed (P is unknown) and is a random variable (since it 
depends on the data). Hence one way to make a statement about this quantity 
is to say how it relates to an estimate such as the empirical risk P„(g„). This 
relationship can take the form of upper and lower bounds for 

P [R{gn) - Rn{gn) > s] ■ 

For convenience, let Zi = (Xi,Yi) and Z = (X,Y). Given Q define the 

^ = {f ■ {x,y) ^ 'ig{x)^y ■■ g &G} ■ (1) 

Notice that G contains functions with range in {—1, 1} while T contains non- 
negative functions with range in {0, 1}. In the remainder of the tutorial, we will 
go back and forth between T and G (as there is a bijection between them), 
sometimes stating the results in terms of functions in T and sometimes in terms 
of functions in G- It will be clear from the context which classes G and T we 
refer to, and T will always be derived from the last mentioned class G in the 
way of (1). 

We use the shorthand notation P/=E [f{X, Y)] and Pnf=^ Sr=i fi^i^ ^*)- 
P„ is usually called the . . associated to the training sample. 

With this notation, the quantity of interest (difference between true and empir- 
ical risks) can be written as 

Pfn - Pnfn ■ (2) 

An empirical process is a collection of random variables indexed by a class 
of functions, and such that each random variable is distributed as a sum of i.i.d. 
random variables (values taken by the function at the data): 

One of the most studied quantity associated to empirical processes is their 
supremum: 



sup Pf - Pnf . 
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It is clear that if we know an upper bound on this quantity, it will be an 
upper bound on (2). This shows that the theory of empirical processes is a great 
source of tools and techniques for Statistical Learning Theory. 



3.2 Hoeffding’s Inequality 

Let us rewrite again the quantity we are interested in as follows 

1 " 

R{g) - Rn{g) = E [/(Z)] - - ^ /(Z,) . 

i=l 

It is easy to recognize here the difference between the expectation and the 
empirical average of the random variable /(Z). By the law of large numbers, we 
immediately obtain that 

1 " 

P lim - V /(Z,) - E [/(Z)] =0=1. 

n—^oo Tl ^ ^ 

L i=l J 

This indicates that with enough samples, the empirical risk of a function is 
a good approximation to its true risk. 

It turns out that there exists a quantitative version of the law of large numbers 
when the variables are bounded. 



Theorem 1 (Hoeffding). Zi,...,Z„ n 

f{Z) e[a,b] £ > 0, , , 



1 ” 1 / 

- V/(Z,)-E[/(Z)] >£ <2exp - 

\ V 



{b-ay 



Let us rewrite the above formula to better understand its consequences. De- 
note the right hand side by 6. Then 



Pnf-Pf\ > {b-a)^ 



or (by inversion, see Appendix A) with probability at least 1 — <5, 




\Pnf-Pf\<{b-a) 



Applying this to /(Z) = 'ig(x)^Y we get that for any g, and any ^ > 0, with 
probability at least 1 — 5 



R{g) < Rn{g) 



Notice that one has to consider a fixed function g and the probability is with 
respect to the sampling of the data. If the function depends on the data this 
does not apply! 
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3.3 Limitations 

Although the above result seems very nice (since it applies to any class of 
bounded functions), it is actually severely limited. Indeed, what it essentially 
says is that for each (fixed) function / G IF, there is a set S of samples for which 

Pf — Pnf < (and this set of samples has measure P [S'] > 1 — ^). How- 

ever, these sets S may be different for different functions. In other words, for 
the observed sample, only some of the functions in T will satisfy this inequal- 
ity. 

Another way to explain the limitation of Hoeffding’s inequality is the follow- 
ing. If we take for Q the class of all {—1, l}-valued (measurable) functions, then 
for any fixed sample, there exists a function f & P such that 

P/ - P„/ = 1 . 

To see this, take the function which is f{Xi) = Yi on the data and f{X) = —Y 
everywhere else. This does not contradict Hoeffding’s inequality but shows that 
it does not yield what we need. 




Fig. 2. Convergence of the empirical risk to the true risk over the class of functions 



Figure 2 illustrates the above argumentation. The horizontal axis corresponds 
to the functions in the class. The two curves represent the true risk and the em- 
pirical risk (for some training sample) of these functions. The true risk is fixed, 
while for each different sample, the empirical risk will be a different curve. If 
we observe a fixed function g and take several different samples, the point on 
the empirical curve will fluctuate around the true risk with fluctuations con- 
trolled by Hoeffding’s inequality. However, for a fixed sample, if the class Q is 
big enough, one can find somewhere along the axis, a function for which the 
difference between the two curves will be very large. 
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3.4 Uniform Deviations 

Before seeing the data, we do not know which function the algorithm will choose. 
The idea is to consider deviations 



R{fn) - Rnifn) < SUp(i?(/) - i?„(/)) (4) 

In other words, if we can upper bound the supremum on the right, we are 
done. For this, we need a bound which holds simultaneously for all functions in 
a class. 

Let us explain how one can construct such uniform bounds. Consider two 
functions /i , /2 and define 

is^n ^Un) • R fi Rnfi ^ ■ 



This set contains all the ‘bad’ samples, i.e. those for which the bound fails. 
From Hoeffding’s inequality, for each i 

P [Q] < 6 . 

We want to measure how many samples are ‘bad’ for f = 1 or i = 2. For this 
we use (see Appendix A) 

p [Cl U C2] < P [Cl] + P [C2] < 26 . 

More generally, if we have N functions in our class, we can write 

N 

P[CiU...UCiv] <^P[Ci] 

As a result we obtain 



P[3/g{/i,...,/v}:P/-P„/>£] 

N 

<^P[P/,-P„/i>£] 

< N exp (— 2ne^) 

Hence, for Q = {gx , . . . , gn}, for all ^ > 0 with probability at least 1 — 



Vff G R{g) < Rn{g) + 



llogN- 



logy 



2n 



This is an error bound. Indeed, if we know that our algorithm picks functions 
from Q, we can apply this result to gn itself. 

Notice that the main difference with Hoeffding’s inequality is the extra log N 
term on the right hand side. This is the term which accounts for the fact that we 
want N bounds to hold simultaneously. Another interpretation of this term is as 
the number of bits one would require to specify one function in Q. It turns out 
that this kind of coding interpretation of generalization bounds is often possible 
and can be used to obtain error estimates [16]. 
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3.5 Estimation Error 

Using the same idea as before, and with no additional effort, we can also get a 
bound on the estimation error. We start from the inequality 

R{g*) < Rn{g*) + sup(i?(5) - Rn{g )) , 
g&G 

which we combine with (4) and with the fact that since gn minimizes the em- 
pirical risk in Q, 

Rn{g*) - Rn{gn) > 0 

Thus we obtain 

R{gn) = R{gn) - R{g*) + R{g*) 

< Rn{g*) - Rn{gn) + R{gu) - R{g*) + R{g*) 

<2snp\R{g)-Rn{g)\+R{g*) 

g&G 

We obtain that with probability at least 1 — 5 



R{gu) < R{g*) + 2\ 



llogN + log I 
2n 



We notice that in the right hand side, both terms depend on the size of the 
class Q. If this size increases, the first term will decrease, while the second will 
increase. 



3.6 Summary and Perspective 

At this point, we can summarize what we have exposed so far. 

— Inference requires to put assumptions on the process generating the data 
(data sampled i.i.d. from an unknown P), generalization requires knowledge 
(e.g. restriction, structure, or prior). 

— The error bounds are valid with respect to the repeated sampling of training 
sets. 

— For a fixed function g, for most of the samples 

R{g) - Rn{g) ~ 1/Vn 

— For most of the samples if |^| = iV 

sup R{g) - Rn{g) « y/logiV/n 
geG 

The extra variability comes from the fact that the chosen g„ changes with 
the data. 
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So the result we have obtained so far is that with high probability, for a finite 
class of size N, 

geQ V 2n 

There are several things that can be improved: 

— Hoeffding’s inequality only uses the boundedness of the functions, not their 
variance. 

— The union bound is as bad as if all the functions in the class were independent 
(i.e. if fi{Z) and f2{Z) were independent). 

— The supremum over Q of R{g) — Rn{g) is not necessarily what the algorithm 
would choose, so that upper bounding R{gn) — Rn{gn) by the supremum 
might be loose. 



4 Infinite Case: Vapnik-Chervonenkis Theory 

In this section we show how to extend the previous results to the case where the 
class Q is infinite. This requires, in the non-countable case, the introduction of 
tools from Vapnik-Chervonenkis Theory. 

4.1 Refined Union Bound and Countable Case 

We first start with a simple refinement of the union bound that allows to extend 
the previous results to the (countably) infinite case. 

Recall that by Hoeffding’s inequality, for each f £ R, for each ^ > 0 (possibly 
depending on /, which we write 6(f)), 



Pf - Pnf > 




< 6{f) . 



Hence, if we have a countable set R, the union bound immediately yields 



3f£R-.Pf 



Pnf> 




< Y . «/) ^ 



Choosing 6{f) = 6p{f) with '^f^j^p(f) = 1, this makes the right-hand side 
equal to 6 and we get the following result. With probability at least 1 — 



yf£R, Pf<Pnf + 



H7) + ? 



2n 



We notice that if R is finite (with size iV) , taking a uniform p gives the log N 
as before. 
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Using this approach, it is possible to put knowledge about the algorithm into 
p{f), but p should be chosen before seeing the data, so it is not possible to 
‘cheat’ by setting all the weight to the function returned by the algorithm after 
seeing the data (which would give the smallest possible bound). But, in general, 
if p is well-chosen, the bound will have a small value. Hence, the bound can be 
improved if one knows ahead of time the functions that the algorithm is likely 
to pick (i.e. knowledge improves the bound). 

4.2 General Case 

When the set Q is uncountable, the previous approach does not directly work. 
The general idea is to look at the function class ‘projected’ on the sample. More 
precisely, given a sample zi, ... ,Zn, we consider 

^z„...,.„ = {(/(^i),...,/(^n)):/en 

The size of this set is the number of possible ways in which the data (zi , . . . , z„) 
can be classified. Since the functions / can only take two values, this set will 
always be finite, no matter how big IF is. 

Definition 1 (Growth Function). 

- . , n 



S'jp(n) = sup . 



We have defined the growth function in terms of the loss class T but we can 
do the same with the initial class Q and notice that Sjz(n) = Sg{n). 

It turns out that this growth function can be used as a measure of the ‘size’ 
of a class of function as demonstrated by the following result. 



Theorem 2 (Vapnik-Chervonenkis). 

1 - 5 , 



yg G Q, R{g) < Rn{g) + 2 



5 > 0, 



, log-gs(2n) 

n 



■log I 



Notice that, in the finite case where |fj| = iV, we have Sg{n) < N so that this 
bound is always better than the one we had before (except for the constants). 
But the problem becomes now one of computing Sg{n). 



4.3 VG Dimension 

Since g G {—1,1}, it is clear that Sg{n) < 2". If Sg{n) = 2”, there is a set of 
size n such that the class of functions can generate any classification on these 
points (we say that Q , the set). 

Definition 2 (VG Dimension). VC dimension Q 

n 

5'e(n) = 2”. 
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In other words, the VC dimension of a class Q is the size of the largest set 
that it can shatter. 

In order to illustrate this definition, we give some examples. The first one is 
the set of half-planes in R'* (see Figure 3). In this case, as depicted for the case 
d = 2, one can shatter a set of d -I- I points but no set of d -I- 2 points, which 
means that the VC dimension is d -|- I. 




Fig. 3. Computing the VC dimension of hyperplanes in dimension 2: a set of 3 points 
can be shattered, but no set of four points 



It is interesting to notice that the number of parameters needed to define 
half-spaces in is d, so that a natural question is whether the VC dimension 
is related to the number of parameters of the function class. The next example, 
depicted in Figure 4, is a family of functions with one parameter only: 

{sgn(sin(te)) : t G K} 

which actually has infinite VC dimension (this is an exercise left to the reader). 

It remains to show how the notion of VC dimension can bring a solution 
to the problem of computing the growth function. Indeed, at first glance, if we 




Fig. 4. VC dimension of sinusoids 
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know that a class has VC dimension h, it entails that for all n < h, Sg{n) = 2"’ 
and Sg{n) < 2” otherwise. This seems of little use, but actually, an intriguing 
phenomenon occurs for n > h as depicted in Figure 5. The growth function 



log(S(n)) 




Fig. 5. Typical behavior of the log growth function 



which is exponential (its logarithm is linear) up until the VC dimension, becomes 
polynomial afterwards. 

This behavior is captured in the following lemma. 

Lemma 1 (Vapnik and Chervonenkis, Sauer, Shelah). Q 

^ h . n G N. 



h 

Sg(n) < ^ 

z=0 




> h 



^ , /en\^ 

(x) ■ 



Using this lemma along with Theorem 2 we immediately obtain that if Q has 
VC dimension h, with probability at least 1 — 6, 



yg € G, R{g) < Rn{g) + 2y 2 



/.log 2-+ log I 



What is important to recall from this result, is that the difference between 
the true and empirical risk is at most of order 



h\ogn 



An interpretation of VC dimension and growth functions is that they mea- 
sure the size of the class, that is the size of the projection of the class 



Introduction to Statistical Learning Theory 



185 



onto finite samples. In addition, this measure does not just ‘count’ the number 
of functions in the class but depends on the geometry of the class (rather its pro- 
jections). Finally, the finiteness of the VC dimension ensures that the empirical 
risk will converge uniformly over the class to the true risk. 



4.4 Symmetrization 

We now indicate how to prove Theorem 2. The key ingredient to the proof is the 
so-called lemma. The idea is to replace the true risk by an esti- 

mate computed on an independent set of data. This is of course a mathematical 
technique and does not mean one needs to have more data to be able to apply 
the result. The extra data set is usually called ‘virtual’ or ‘ghost sample’. 

We will denote by , . . . , an independent (ghost) sample and by the 
corresponding empirical measure. 

Lemma 2 (Symmetrization). t > 0, nt^ > 2 



sup(P- P„)/ > t 


CM 

VI 


SUP(P/ - Pn)f> t/2 


_/eP 


1 


_/eP 



Proof. Let /„ be the function achieving the supremum (note that it depends 
on Z\, . . . , Zn). One has (with A denoting the conjunction of two events), 



- '"(P-Pn)/r.>iA(P4-P)/„>-t/2 

— ^(P,l-Pn)/n>V2 ■ 

Taking expectations with respect to the second sample gives 



1(P-P„)/„>*P' [(P - P'u)fn < t/2] < P' [(P; - P„)/„ > t/2] . 



By Chebyshev’s inequality (see Appendix A) , 



P' [{P-P'n)fn>t/2]< 



4Var/n 




Indeed, a random variable with range in [0,1] has variance less than 1/4. 
Hence 

l(P-P„)/„>t(l - ^ - ’P")/" > ^/2] • 

Taking expectation with respect to first sample gives the result. □ 



This lemma allows to replace the expectation Pf by an empirical average 
over the ghost sample. As a result, the right hand side only depends on the 
of the class T on the double sample: 
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which contains finitely many different vectors. One can thus use the simple union 
bound that was presented before in the finite case. The other ingredient that is 
needed to obtain Theorem 2 is again Hoeffding’s inequality in the following form: 

P [Pnf - P'J >t]< , 

We now just have to put the pieces together: 

P [supjg^(P- P„)/ > t] 

<2P [sMVf^APn-Pn)f>t/2] 

= 2P [supje^^^ - Pn)f > f/2 

<25^(2n)P[(P;-P„)/>i/2] 

< 45'^(2n)e-”*'/® . 

Using inversion finishes the proof of Theorem 2. 

4.5 VC Entropy 

One important aspect of the VC dimension is that it is 

Hence, it allows to get bounds that do not depend on the problem at hand: 
the same bound holds for any distribution. Although this may be seen as an 
advantage, it can also be a drawback since, as a result, the bound may be loose 
for most distributions. 

We now show how to modify the proof above to get a distribution-dependent 
result. We use the following notation N (IF, zf ) := |Pzi,...,z„ |- 

Definition 3 (VC Entropy). ( ) VC entropy 

PHn) = logE[V(P,Z”)] . 




Proof. We again begin with the symmetrization lemma so that we have to 
upper bound the quantity 

I = P [sup;g^^„ (P; - P„)/ > t/2 . 

Let CTi,...,cr„ be n independent random variables such that P{ui = 1) = 
P(cTj = —1) = 1/2 (they are called Rademacher variables). We notice that the 
quantities (P() — Pn)f and P ~ have the same distribution 

since changing one (7i corresponds to exchanging Zi and Z'^. Hence we have 
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/ < E I P,, ^ a.(/(Z') - /(Z,)) > 1/2 

and the union bound leads to 



i=l 



/ < E 



N{J^, Z/,Z{*')maxI 



n 

-^a.(/(Z')-/(Z,))>t/2 



Since — f{Zi)) G [—1, 1], Hoeffding’s inequality finally gives 

I <E[N{J^, Z, 

The rest of the proof is as before. 



□ 



5 Capacity Measures 

We have seen so far three measures of or size of classes of function: the 

VC dimension and growth function both distribution independent, and the VC 
entropy which depends on the distribution. Apart from the VC dimension, they 
are usually hard or impossible to compute. There are however other measures 
which not only may give sharper estimates, but also have properties that make 
their computation possible from the data only. 

5.1 Covering Numbers 

We start by endowing the function class IF with the following (random) metric 

dnif, f) = -\{f{Z^) ^ f{ZC) : t = 1, . . . , n}| . 
n 

This is the normalized Hamming distance of the ‘projections’ on the sample. 
Given such a metric, we say that a set /i, . . . , /at IF at radius e if 



We then define the covering numbers of T as follows. 

Definition 4 (Covering Number). IF e 

N{T,e,n) . 
e 

Notice that it does not matter if we apply this definition to the original class 
Q or the loss class IF, since N{T,e,n) = N{Q,e,n). 

The covering numbers characterize the size of a function class as measured by 
the metric The rate of growth of the logarithm of N{Q,e,n) usually called the 
metric entropy, is related to the classical concept of vector dimension. Indeed, if 
5 is a compact set in a d-dimensional Euclidean space, N{Q,e,n) « 
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When the covering numbers are finite, it is possible to approximate the class 
Q hy a finite set of functions (which cover Q). Which again allows to use the 
finite union bound, provided we can relate the behavior of all functions in Q to 
that of functions in the cover. A typical result, which we provide without proof, 
is the following. 

Theorem 4. t > 0 

¥[3geg-. R{g) > Rn{g) + t] < 8E [N{g, t, n)] _ 

Covering numbers can also be defined for classes of real-valued functions. 
We now relate the covering numbers to the VC dimension. Notice that, be- 
cause the functions in g can only take two values, for all e > 0, N{g,e,n) < 
\gz" \ = N{g,Zi). Hence the VC entropy corresponds to log covering numbers 
at minimal scale, which implies N{g,e,n) < ft- log but one can have a con- 
siderably better result. 

Lemma 3 (Haussler). g ft £ > 0 

N{g,e,n) < Ch{Ae)^e~^ . 

The interest of this result is that the upper bound does not depend on the 
sample size n. 

The covering number bound is a generalization of the VC entropy bound 
where the scale is adapted to the error. It turns out that this result can be 
improved by considering all scales (see Section 5.2). 

5.2 Rademacher Averages 

Recall that we used in the proof of Theorem 3 Rademacher random variables, 
i.e. independent { — 1, l}-valued random variables with probability 1/2 of taking 
either value. 

For convenience we introduce the following notation (signed empirical mea- 
sure) Rnf = ^ X^r=i We will denote by the expectation taken with 

respect to the Rademacher variables (i.e. conditionally to the data) while E will 
denote the expectation with respect to all the random variables (i.e. the data, 
the ghost sample and the Rademacher variables). 

Definition 5 (Rademacher Averages). T 



n{T) = E sup Rnf , 
/ 6 ^ 



TZn{R) = E^ sup Rnf ■ 



We now state the fundamental result involving Rademacher averages. 
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Theorem 5. 5 > 0 ^ 



V/G^,P/<P„/ + 27^(^) + 
1 -^, 



V/e^,P/<P„/ + 27^„(J^) + 

It is remarkable that one can obtain a bound (second part of the theorem) 
which depends solely on the data. 

The proof of the above result requires a powerful tool called a concentration 
inequality for empirical processes. 

Actually, Hoeffding’s inequality is a (simple) concentration inequality, in the 
sense that when n increases, the empirical average is concentrated around the 
expectation. It is possible to generalize this result to functions that depend on 
i.i.d. random variables as shown in the theorem below. 

Theorem 6 (McDiarmid [17]). i = 1, . . . ,n, 

sup \F{zi , . . . ,Zi, . . . ,z„) - F{zi, ...,z',...,z„')l < c, 

£ > 0 , 



1 -^, 




P[|A-E[F] I > £] < 2exp ■ 

The meaning of this result is thus that, as soon as one has a function of n 
independent random variables, which is such that its variation is bounded when 
one variable is modified, the function will satisfy a Hoeffding-like inequality. 

Proof of Theorem 5. To prove Theorem 5, we will have to follow the following 
three steps: 

1. Use 

2. use 

3. use 
one 

We first show that McDiarmid’s inequality can be applied to supjg^pP/ — 
Pnf ■ We denote temporarily by the empirical measure obtained by modifying 
one element (e.g. Zi is replaced by Z') of the sample. It is easy to check that 
the following holds 

I SUp(P/ - Pnf) - SUp(P/ - Pff)\ < sup \Pff - Pnf\ ■ 



to relate sup f Pf — Pnf to its expectation, 
to relate the expectation to the Rademacher average, 
again to relate the Rademacher average to the conditional 
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Since / G {0, 1} we obtain 

\P:,f - Pnf\ = -\f{Z') - f{Zi)\ < - , 
n n 

and thus McDiarmid’s inequality can be applied with c = 1/n. This concludes 
the first step of the proof. 

We next prove the (first part of the) following symmetrization lemma. 

Lemma 4. T 

E sup Pf - Pnf < 2E sup Rnf , 

E sup \Pf - Pnf\ > ^E sup 7^„/ - . 

fey^ 2 Zy'n 

Proof. We only prove the first part. We introduce a ghost sample and its 
corresponding measure We successively use the fact that EP^/ = Pf and 
the supremum is a convex function (hence we can apply Jensen’s inequality, see 
Appendix A): 

E sup Pf - Pnf 
f(^y^ 

= EsupE[P;/]-P„/ 

P^y^ 

< Esup P;/ - P„/ 

/e^ 

1 ” 

= E,E sup-^a,(/(Z')-/(Z,)) 

^ n ' ' 1 ^ 

< EctE sup - cr*/(z') +E,^E sup - y^-cTi/(Zj)) 

= 2E sup Rnf ■ 
fey^ 

where the third step uses the fact that f{Zi) — f{Zfj and ai{f{Zi) — f{Z'f)) 
have the same distribution and the last step uses the fact that the aif{Zi) and 
—aif{Z'f) have the same distribution. □ 

The above already establishes the first part of Theorem 5. For the second part, 
we need to use concentration again. For this we apply McDiarmid’s inequality 
to the following functional 

F{Zi,. . . , Zn) = P„(P) . 

It is easy to check that F satisfies McDiarmid’s assumptions with c = A As 
a result, EP = TZ{P) can be sharply estimated by P = P„(P). 
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Loss Class and Initial Class. In order to make use of Theorem 5 we have to 
relate the Rademacher average of the loss class to those of the initial class. This 
can be done with the following derivation where one uses the fact that ai and 
aiYi have the same distribution. 



TZ{T) = E 



1 

sup — 
geQ ^ 



n 

i^l 



= E 



1 , ^ 1 

sup - V cr*-(l - Yig{X,)) 

see u 2 




sup ~'^ aiYig{Xi) 

gegnfzl 



IniG). 



Notice that the same is valid for conditional Rademacher averages, so that 
we obtain that with probability at least 1 — 



€ G , R{g) Y Rn{g) + R-niG) + 




Computing the Rademacher Averages. We now assess the difficulty of 
actually computing the Rademacher averages. We write the following. 



1 

2 



E 



1 

sup — 
geg n 



a,g{X,) 

i^l 



= 2 +*' 



1 1 ~ CTig(Ai) 

b 



= - -E 
2 



= - -E 
2 



geg n 2 

1—1 



inf R„{g,a) 
g&g 



This indicates that, given a sample and a choice of the random variables 
(Ti, . . . , cr„, computing TZn{G) is not harder than computing the empirical risk 
minimizer in G- Indeed, the procedure would be to generate the ai randomly 
and minimize the empirical error in G with respect to the labels ai. 

An advantage of rewriting TZn{G) as above is that it gives an intuition of what 
it actually measures: it measures how much the class G can fit random noise. If 
the class G is very large, there will always be a function which can perfectly fit 
the ai and then TZn{G) = 1/2, so that there is no hope of uniform convergence 
to zero of the difference between true and empirical risks. 

For a finite set with |tj| = N, one can show that 



Rn{G) < 2i/log A/n, 
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where we again see the logarithmic factor logiV. A consequence of this is that, 
by considering the projection on the sample of a class Q with VC dimension h, 
and using Lemma 1, we have 



n{G) < 2 



hlog 



en 

h 



This result along with Theorem 5 allows to recover the Vapnik Chervonenkis 
bound with a concentration-based proof. 

Although the benefit of using concentration may not be entirely clear at that 
point, let us just mention that one can actually improve the dependence on n 
of the above bound. This is based on the so-called . . technique. The idea 
is to use covering numbers at all scales in order to capture the geometry of the 
class in a better way than the VC entropy does. 

One has the following result, called Dudley’s entropy bound 






^/n 




\/logN{!F,t,n) dt . 



As a consequence, along with Haussler’s upper bound, we can get the follow- 
ing result 

nn{T)<C\~. 

V n 

We can thus, with this approach, remove the unnecessary logn factor of the 
VC bound. 



6 Advanced Topics 

In this section, we point out several ways in which the results presented so far 
can be improved. The main source of improvement actually comes, as mentioned 
earlier, from the fact that Hoeffding and McDiarmid inequalities do not make 
use of the variance of the functions. 

6.1 Binomial Tails 

We recall that the functions we consider are binary valued. So, if we consider a 
fixed function /, the distribution of P„/ is actually a binomial law of parameters 
Pf and n (since we are summing n i.i.d. random variables f{Zi) which can either 
be 0 or 1 and are equal to 1 with probability E/(Zi) = Pf). Denoting p = Pf, 
we can have an exact expression for the deviations of P„/ from Pf: 



P [Pf - Pnf > t] 
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Since this expression is not easy to manipulate, we have used an upper bound 
provided by Hoeffding’s inequality. However, there exist other (sharper) upper 
bounds. The following quantities are an upper bound on P [Pf — P„f > t], 

(p+t) (exponential) 

g-T^((i-t/p)iog(i-t/p)+t/p) (Bennett) 

nt^ / X 

g 2p(i-p)+2t/3 (Bernstein) 

(Hoeffding) 



Examining the above bounds (and using inversion), we can say that roughly 
speaking, the small deviations of Pf — P„/ have a Gaussian behavior of the 
form exp(— nt^/2p(l — p)) (i.e. Gaussian with variance p(l — p)) while the large 
deviations have a Poisson behavior of the form exp(— 3nt/2). 

So the tails are heavier than Gaussian, and Hoeffding’s inequality consists in 
upper bounding the tails with a Gaussian with maximum variance, hence the 
term exp(— 2nt^). 

Each function f G P has a different variance P/(l — Pf) < Pf - Moreover, 
for each f G P, hy Bernstein’s inequality, with probability at least 1 — 



Pf < Pnf + 



2Pf log ^ 



n 



21ogj 

3n 



The Gaussian part (second term in the right hand side) dominates (for Pf 
not too small, or n large enough), and it depends on Pf. We thus want to 
combine Bernstein’s inequality with the union bound and the symmetrization. 



6.2 Normalization 

The idea is to consider the ratio 



Pf-J^nf 

Here (/ G {0, 1}), Var/ < Pp = Pf 

The reason for considering this ration is that after normalization, fluctuations 
are more ‘uniform’ in the class P . Hence the supremum in 



sup 



Pf-J\f 

Vip 



not necessarily attained at functions with large variance as it was the case pre- 
viously. 

Moreover, we know that our goal is to find functions with small error Pf 
(hence small variance). The normalized supremum takes this into account. 

We now state a result similar to Theorem 2 for the normalized supremum. 




194 O. Bousquet et al. 



Theorem 7 (Vapnik-Chervonenkis, [18]). ^ > 0^ 

1-5, 



yfeT, 



Pf - Pnf ^ /logS'^(2n) +log| 

^^7PT-1 S 



1-5, 



yf&p, 



Pnf -Pf ^ /logS'^(2n) +log| 



Proof. We only give a sketch of the proof. The first step is a variation of the 
symmetrization lemma 



Pf -Pnf 

sup > t 

f&F vPf 



< 2P 



PLf - Pnf 

sup — , 

ViPnf+Pkf)/^ 



> t 



The second step consists in randomization (with Rademacher variables) 



sup — ^ > t 

V(5^n/ + PA/)/2 



• • • = 2E 

Finally, one uses a tail bound of Bernstein type. 



□ 



Let us explore the consequences of this result. 

From the fact that for non-negative numbers A, B, C, 

A<B + cVA ^ A<B + C‘^ + VbC , 

we easily get for example 



V/G.F, Pf<Pnf+2 




log6'jr(2n) -klog 
n 



4 

6 



^ ^ logSj:-(2n) + logl 
n 



In the ideal situation where there is no noise (i.e. Y = t{X) almost surely), 
and t G Q, denoting by g„ the empirical risk minimizer, we have R* = 0 and 
also Rn{gn) = 0. In particular, when ^ is a class of VC dimension h, we obtain 



R{9n) = O 



hlogn 



So, in a way. Theorem 7 allows to interpolate between the best case where 
the rate of convergence is 0{hlogn/n) and the worst case where the rate is 
0(i//ilogn/n) (it does not allow to remove the logn factor in this case). 
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It is also possible to derive from Theorem 7 relative error bounds for the 
minimizer of the empirical error. With probability at least 1 — 



R{gn) < R{g*) + 2 




log5'g(2n) +log| 
n 



^ ^ log5'g(2n) +log| 
n 



We notice here that when R{g*) = 0 (i.e. t G Q and R* = 0), the rate is again 
of order 1/n while, as soon as R{g*) > 0, the rate is of order 1/^/n. Therefore, 
it is not possible to obtain a rate with a power of n in between —1/2 and —1. 

The main reason is that the factor of the square root term R{g*) is not the 
right quantity to use here since it does not vary with n. We will see later that 
one can have instead R{gn) — R{g*) as a factor, which is usually converging to 
zero with n increasing. Unfortunately, Theorem 7 cannot be applied to functions 
of the type f — f* (which would be needed to have the mentioned factor), so we 
will need a refined approach. 



6.3 Noise Conditions 

The refinement we seek to obtain requires certain specific assumptions about the 
noise function s{x). The ideal case being when s{x) = 0 everywhere (which cor- 
responds to i?* = 0 and Y = t{X)). We now introduce quantities that measure 
how well-behaved the noise function is. 

The situation is favorable when the regression function rj{x) is not too close 
to 0, or at least not too often close to 1/2. Indeed, r]{x) = 0 means that the noise 
is maximum at x (s{x) = 1/2) and that the label is completely undetermined 
(any prediction would yield an error with probability 1/2). 

Definitions. There are two types of conditions. 

Definition 6 (Massart’s Noise Condition). c > 0 

Hx)\ > 1 

This condition implies that there is no region where the decision is completely 
random, or the noise is bounded away from 1/2. 

Definition 7 (Tsybakov’s Noise Condition). a G [0, 1], 

(i) 3c >0, Vg G {-1, 

F[g{XMX)<0]<c{R{g)-R*r 

(ii) 3c > 0, VA C X, j dP{x) < c{ j \rj{x)\dP{x))°‘ 

J A J A 

{in) 3B >0, Vt > 0, P [|?7(-^)| <t]< 
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Condition (mi) is probably the easiest to interpret: it means that rj{x) is close 
to the critical value 0 with low probability. 

We indicate how to prove that conditions and (Hi) are indeed equiv- 

alent: 

(z) (zz) It is easy to check that — = E [|?7(X)|1lg^<o]. For each function 

g, there exists a set A such that 1 a = 1g?j<o 
(zz) (in) Let A = {x : \r]{x)\ < t} 



<t]= dP{x) < c( / \'q{x)\dP{x)y 

J A J A 

< ct“( [ dP{x)T 

J A 

P [|?7| <t]< 



(zzz) (z) We write 



R{g)-R* =EMX)\gg<0] 

> tE [lg»7<o1|r)|i] 

= tP[|?7|t] -tE [lg^>o1|r,|t] 

> t(l — ) — tP [gr] > 0] = t{F [gg < 0] — . 

Taking t = ^ ^ finally gives 



P [gr] < 0] < 



Bi-a 

(1 — a)ll — a)a°‘ 



{R{g) - R*r ■ 



We notice that the parameter a has to be in [0,1]. Indeed, one has the 
opposite inequality 



R{g) -R*=E [\7]{X)\^g^<o] < E [lg^<o] = P [g{X)r]{X) < 0] , 

which is incompatible with condition (z) if a > 1. 

We also notice that when a = 0, Tsybakov’s condition is void, and when 
a = 1, it is equivalent to Massart’s condition. 



Consequences. The conditions we impose on the noise yield a crucial rela- 
tionship between the variance and the expectation of functions in the so-called 
relative loss class defined as 

:F={{x,y)^ f{x, y) - ^t(x)^y ■ f € R} ■ 

This relationship will allow to exploit Bernstein type inequalities applied to 
this latter class. 

Under Massart’s condition, one has (written in terms of the initial class) for 
9 ^ Gi 

E [(1g(x)5^v - 1t(x)#y)^] < c{R{g) - R*) , 
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or, equivalently, for f G !F, Var/ < Pf^ < cPf. Under Tsybakov’s condition 
this becomes for g G Q, 

E [i%ix)^Y - < c{R{g) - R*r , 

and for f G T, Var/ < Pp < c{Pf)°‘. 

In the finite case, with |tj| = N, one can easily apply Bernstein’s inequality 
to T and the finite union bound to get that with probability at least 1 — for 
all g G 



R{g) -R*< RPg) 



Rn{t) 



l8c{R{g) -R*p log f 



41ogf 

3n 



As a consequence, when t G Q, and g„ is the minimizer of the empirical error 
(hence Rn{g) < Rn{t)), one has 



R{9n) -R* <C 



'logfV“ 



which always better than n for a > 0 and is valid even if R* > 0. 



6.4 Local Rademacher Averages 

In this section we generalize the above result by introducing a localized version 
of the Rademacher averages. Going from the finite to the general case is more in- 
volved than what has been seen before. We first give the appropriate definitions, 
then state the result and give a proof sketch. 

Definitions. Local Rademacher averages refer to Rademacher averages of sub- 
sets of the function class determined by a condition on the variance of the func- 
tion. 

Definition 8 (Local Rademacher Average). 

r > 0 T 

7?.(iF, r)=E sup Rnf ■ 

fex-.pp<r 

The reason for this definition is that, as we have seen before, the crucial 
ingredient to obtain better rates of convergence is to use the variance of the 
functions. Localizing the Rademacher average allows to focus on the part of the 
function class where the fast rate phenomenon occurs, that are functions with 
small variance. 

Next we introduce the concept of a sub-root function, a real-valued function 
with certain monotony properties. 
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Definition 9 (Sub- root Function). -0 : K — > K 

O 

()i^ 

( ) Ip{r)/^ 

An immediate consequence of this definition is the following result. 
Lemma 5. 

() 

( ) ( ) fixed point r* = r* 

Figure 6 shows a typical sub-root function and its fixed point. 




Fig. 6. An example of a sub-root function and its fixed point 

Before seeing the rationale for introducing the sub-root concept, we need yet 
another definition, that of a ‘star-hull’ (somewhat similar to a convex hull). 

Definition 10 (Star-Hull). T 

*.F={a/:/e.F, oG [0,1]}. 

Now, we state a lemma that indicates that by taking the star-hull of a class 
of functions, we are guaranteed that the local Rademacher average behaves like 
a sub-root function, and thus has a unique fixed point. This fixed point will turn 
out to be the key quantity in the relative error bounds. 

Lemma 6. T 



r) 

One legitimate question is whether taking the star-hull does not enlarge the 
class too much. One way to see what the effect is on the size of the class is to 
compare the metric entropy (log covering numbers) of T and of 'kT . It is possible 
to see that the entropy increases only by a logarithmic factor, which is essentially 
negligible. 
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Result. We now state the main result involving local Rademacher averages and 
their fixed point. 

Theorem 8. !F ( f G [—1,11) r* 

Tl{*T,r) C > 0 , , , 

1-^, 



V/ GT,Pj- Pnf < C 



(v^ 



►Var/ 



log i + log log n 



P 



Var/ < c{Pff, 



1 - 6 , 



yf&^,Pf<C{ Pnf + (r*)^ + ^^Ss+^oglogn 



Proof. We only give the main steps of the proof. 

1. The starting point is Talagrand’s inequality for empirical processes, a gen- 
eralization of McDiarmid’s inequality of Bernstein type (i.e. which includes 
the variance). This inequality tells that with high probability. 



sup Pf - Pnf < E sup Pf - Pnf 



+ c 



sup Vsirfjn + din 

fey^ 



for some constants c, d . 

2. The second step consists in ‘peeling’ the class, that is splitting the class into 
subclasses according to the variance of the functions 

.Ffc = {/: Var/G [cr^x'=+l)}, 



3. We can then apply Talagrand’s inequality to each of the sub-classes sepa- 
rately to get with high probability 



sup Pf - Pnf < E 

fey^k 



sup Pf - Pnf 
fey^k 



c\/ x\/ar f jn + d jn , 



4. Then the symmetrization lemma allows to introduce local Rademacher av- 
erages. We get that with high probability 

V/ G iF, Pf — Pnf < 2TZ{P, xVar/) -|- c\J x'daxf jn + d jn. 



5. We then have to ‘solve’ this inequality. Things are simple if TZ behaves like a 
square root function since we can upper bound the local Rademacher average 
by the value of its fixed point. With high probability, 



Pf - Pnf < 2i/r*Var/ -|- ca / xMarffn + d jn . 
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6. Finally, we use the relationship between variance and expectation 

Var/ < c(P/)“ , 

and solve the inequality in Pf to get the result. 



□ 

We will not got into the details of how to apply the above result, but we give 
some remarks about its use. 

An important example is the case where the class P is of finite VC dimension 
h. In that case, one has 



V n 

so that r* < C ^ ” . As a consequence, we obtain, under Tsybakov condition, a 

rate of convergence of P/„ to Pf* is It is important to note that 

in this case, the rate of convergence of P„/ to Pf in 0{l/y/n). So we obtain 
a fast rate by looking at the relative error. These fast rates can be obtained 
provided t G Q (but it is not needed that R* = 0). This requirement can be 
removed if one uses structural risk minimization or regularization. 

Another related result is that, as in the global case, one can obtain a bound 
with data-dependent (i.e. conditional) local Rademacher averages 



P„(P,r)=Eo- sup Rnf ■ 

f&F-.Pp<r 



The result is the same as before (with different constants) under the same 
conditions as in Theorem 8. With probability at least 1 — 



Pf<C\Pnf + {r*J^ + 



, logT + loglog) 



where r* is the fixed point of a sub-root upper bound of P„(lF, r). 

Hence, we can get improved rates when the noise is well-behaved and these 
rates interpolate between and n~^. However, it is not in general possible 

to estimate the parameters (c and a) entering in the noise conditions, but we will 
not discuss this issue further here. Another point is that although the capacity 
measure that we use seems ‘local’, it does depend on all the functions in the 
class, but each of them is implicitly appropriately rescaled. Indeed, in TZfkP,r), 
each function f G P with Pf^ > r is considered at scale r/Pp. 



Bibliographical Remarks. Hoeffding’s inequality appears in [19]. For a proof 
of the contraction principle we refer to Ledoux and Talagrand [20] . 

Vapnik-Chervonenkis-Sauer-Shelah’s lemma was proved independently by 
Sauer [21], Shelah [22], and Vapnik and Chervonenkis [18]. For related com- 
binatorial results we refer to Alesker [23], Alon, Ben-David, Cesa-Bianchi, and 
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Haussler [24], Cesa-Bianchi and Haussler [25], Frankl [26], Haussler [27], Szarek 
and Talagrand [28]. 

Uniform deviations of averages from their expectations is one of the cen- 
tral problems of empirical process theory. Here we merely refer to some of 
the comprehensive coverages, such as Dudley [29], Gine [30], Vapnik [1], van 
der Vaart and Wellner [31]. The use of empirical processes in classification 
was pioneered by Vapnik and Chervonenkis [18, 15] and re-discovered 20 years 
later by Blumer, Ehrenfeucht, Haussler, and Warmuth [32], Ehrenfeucht, Haus- 
sler, Kearns, and Valiant [33]. For surveys see Anthony and Bartlett [2], De- 
vroye, Gyorfi, and Lugosi [4], Kearns and Vazirani [7], Natarajan [12], Vapnik 
[14, 1]. 

The question of how snpf^y7{P{f) — Pn{f)) behaves has been known as the 
Glivenko-Gantelli problem and much has been said about it. A few key references 
include Alon, Ben-David, Gesa-Bianchi, and Haussler [24], Dudley [34, 35, 36], 
Talagrand [37, 38], Vapnik and Ghervonenkis [18, 39]. 

The VC dimension has been widely studied and many of its properties are 
known. We refer to Anthony and Bartlett [2], Assouad [40], Gover [41], Dudley 
[42, 29], Goldberg and Jerrum [43], Karpinski and A. Macintyre [44], Khovanskii 
[45], Koiran and Sontag [46], Macintyre and Sontag [47], Steele [48], and Wenocur 
and Dudley [49]. 

The bounded differences inequality was formulated explicitly first by McDi- 
armid [17] who proved it by martingale methods (see the surveys [17], [50]), 
but closely related concentration results have been obtained in various ways in- 
cluding information-theoretic methods (see Alhswede, Gacs, and Korner [51], 
Marton [52], [53], [54], Dembo [55], Massart [56] and Rio [57]), Talagrand’s in- 
duction method [58], [59], [60] (see also Luczak and McDiarmid [61], McDiarmid 
[62], Panchenko [63, 64, 65]) and the so-called “entropy method”, based on loga- 
rithmic Sobolev inequalities, developed by Ledoux [66], [67], see also Bobkov and 
Ledoux [68], Massart [69], Rio [57], Boucheron, Lugosi, and Massart [70], [71], 
Boucheron, Bousquet, Lugosi, and Massart [72], and Bousquet [73]. 

Symmetrization lemmas can be found in Gine and Zinn [74] and Vapnik and 
Ghervonenkis [18, 15]. 

The use of Rademacher averages in classification was first promoted by 
Koltchinskii [75] and Bartlett, Boucheron, and Lugosi [76], see also Koltchin- 
skii and Panchenko [77, 78], Bartlett and Mendelson [79], Bartlett, Bousquet, 
and Mendelson [80], Bousquet, Koltchinskii, and Panchenko [81], Kegl, Linder, 
and Lugosi [82]. 



A Probability Tools 

This section recalls some basic facts from probability theory that are used 
throughout this tutorial (sometimes without explicitly mentioning it). 

We denote by A and B some events (i.e. elements of a a-algebra), and by X 
some real- valued random variable. 
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A.l Basic Facts 

— Union: 

P [A or _B] < P [A] + P [_B] . 

— Inclusion: li A ^ B, then P [A] < P [B], 

— Inversion: If P [A > t] < F{t) then with probability at least 1 — <5, 

X<F~^{6). 

— Expectation: If A > 0, 

pOO 

E [A] = / V[X>t]dt. 

Jo 



A. 2 Basic Inequalities 

All the inequalities below are valid as soon as the right-hand side exists. 

— Jensen: for / convex, 

/(E[A])<E[/(A)] . 

— Markov: If A > 0 then for all t > 0, 

P [^ > i] < ■ 

— Chebyshev: for t > 0, 

P[|A-E[A]|>t]<^. 

~ Chernoff: for all t G K, 



P [A > d < inf E 

A>0 



gA(X-t)' 



B No Free Lunch 



We can now give a formal definition of consistency and state the core results 
about the impossibility of universally good algorithms. 



Definition 11 (Consistency). 

P 



lim R{gn) = R* 



It is important to understand the reasons that make possible the existence of 
consistent algorithms. In the case where the input space X is countable, things 
are somehow easy since even if there is no relationship at all between inputs and 
outputs, by repeatedly sampling data independently from P, one will get to see 
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an increasing number of different inputs which will eventually converge to all 
the inputs. So, in the countable case, an algorithm which would simply learn ‘by 
heart’ (i.e. makes a majority vote when the instance has been seen before, and 
produces an arbitrary prediction otherwise) would be consistent. 

In the case where X is not countable (e.g. df = K), things are more sub- 
tle. Indeed, in that case, there is a seemingly innocent assumption that be- 
comes crucial: to be able to define a probability measure P on df, one needs 
a (T-algebra on that space, which is typically the Borel a-algebra. So the hid- 
den assumption is that P is a Borel measure. This means that the topology 
of K plays a role here, and thus, the target function t will be Borel measur- 
able. In a sense this guarantees that it is possible to approximate t from its 
value (or approximate value) at a finite number of points. The algorithms that 
will achieve consistency are thus those who use the topology in the sense of 
‘generalizing’ the observed values to neighborhoods (e.g. local classifiers). In a 
way, the measurability of t is one of the crudest notions of smoothness of func- 
tions. 

We now cite two important results. The first one tells that for a fixed sample 
size, one can construct arbitrarily bad problems for a given algorithm. 

Theorem 9 (No Free Lunch, see e.g. [4]). n 

e > 0 P P* = 0 



R{9n) ^ 2 “ ^ 



= 1 . 



The second result is more subtle and indicates that given an algorithm, one 
can construct a problem for which this algorithm will converge as slowly as one 
wishes. 



Theorem 10 (No Free Lunch at All, see e.g. [4]). 

(fln) . 0, , 

P* = 0 



R'idn') P • 



P 



In the above theorem, the ‘bad’ probability measure is constructed on a 
countable set (where the outputs are not related at all to the inputs so that no 
generalization is possible) , and is such that the rate at which one gets to see new 
inputs is as slow as the convergence of a„. 

Finally we mention other notions of consistency. 

Definition 12 (VC Consistency of ERM). 

P 
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Definition 13 (VC Non-trivial Consistency of ERM). 

Q 



c G K, 



inf P, 

fey^:Pf>c 



if) 



inf P(f) 

feP-.Pf>c 



P 
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Abstract. Concentration inequalities deal with deviations of functions 
of independent random variables from their expectation. In the last 
decade new tools have been introduced making it possible to establish 
simple and powerful inequalities. These inequalities are at the heart of 
the mathematical analysis of various problems in machine learning and 
made it possible to derive new efficient algorithms. This text attempts 
to summarize some of the basic tools. 



1 Introduction 

The laws of large numbers of classical probability theory state that sums of 
independent random variables are, under very mild conditions, close to their 
expectation with a large probability. Such sums are the most basic examples 
of random variables concentrated around their mean. More recent results reveal 
that such a behavior is shared by a large class of general functions of independent 
random variables. The purpose of these notes is to give an introduction to some 
of these general concentration inequalities. 

The inequalities discussed in these notes bound tail probabilities of general 
functions of independent random variables. Several methods have been known to 
prove such inequalities, including martingale methods (see Milman and Schecht- 
man [1] and the surveys of McDiarmid [2, 3]), information-theoretic methods (see 
Alhswede, Gacs, and Korner [4], Marton [5, 6, 7], Dembo [8], Massart [9] and Rio 
[10]), Talagrand’s induction method [11, 12, 13] (see also Luczak and McDiarmid 
[14], McDiarmid [15] and Panchenko [16, 17, 18]), the decoupling method sur- 
veyed by de la Pena and Gine [19], and the so-called “entropy method”, based on 
logarithmic Sobolev inequalities, developed by Ledoux [20, 21], see also Bobkov 
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and Ledoux [22], Massart [23], Rio [10], Klein [24], Boucheron, Lugosi, and Mas- 
sart [25, 26], Bousquet [27, 28], and Boucheron, Bousquet, Lugosi, and Massart 
[29]. Also, various problem-specific methods have been worked out in random 
graph theory, see Janson, Luczak, and Rucihski [30] for a survey. 

First of all we recall some of the essential basic tools needed in the rest of 
these notes. For any nonnegative random variable X, 

pOO 

EX = P{A > t}dt . 

Jo 



This implies Markov’s inequality: for any nonnegative random variable X, 
and t > 0, 



p{a: > t} < 



EA 

t 



If follows from Markov’s inequality that if is a strictly monotonically in- 
creasing nonnegative-valued function then for any random variable X and real 
number t, 



F{X > t} = ¥{<j){X) >(t>{t)}< 



E(/)(A) 



An application of this with (f>{x) = is Chebyshev’s inequality: if X is an 
arbitrary random variable and t > 0, then 



P{]A -EA] > t} = P{1A-EA]2 > t^} < 



E [1A-EA]2' 



Var{A} 

^2 



More generally taking 4>{x) = x‘^ (a; > 0), for any <7 > 0 we have 



In specific examples one may choose the value of q to optimize the obtained 
upper bound. Such moment bounds often provide with very sharp estimates 
of the tail probabilities. A related idea is at the basis of Chernoff’s bounding 
method. Taking 4>{x) = 6^*“ where s is an arbitrary positive number, for any 
random variable A, and any t > 0, we have 

P{A >t} = P{e"^ > e"*} < — . 



In Chernoff’s method, we find an s > 0 that minimizes the upper bound or 
makes the upper bound small. 

Next we recall some simple inequalities for sums of independent random vari- 
ables. Here we are primarily concerned with upper bounds for the probabilities 
of deviations from the mean, that is, to obtain inequalities for P{S'„ — ES'„ > t}, 
with Sn = where Ai, . . . , A„ are independent real-valued random vari- 

ables. 

Chebyshev’s inequality and independence immediately imply 



¥{\Sn- E^nl >t}< 



EtiVar{Aj 
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In other words, writing ^ X)r=i Var{Xi}, 



P 




i=l 





Chernoff’s bounding method is especially convenient for bounding tail prob- 
abilities of sums of independent random variables. The reason is that since the 
expected value of a product of independent random variables equals the product 
of the expected values, Chernoff’s bound becomes 






^Sn >t}< e-®‘E 



exp j - 



= e 



n 

T[ie[' 



\ i=l 
^s(Xi-EXi) 




(by independence). 



( 1 ) 



Now the problem of finding tight bounds comes down to finding a good upper 
bound for the moment generating function of the random variables Xi — EJfi . 
There are many ways of doing this. For bounded random variables perhaps the 
most elegant version is due to Hoeffding [31] which we state without proof. 

Lemma 1. hoeffding’s inequality. X EX = 

0, a<X<6 s>0 

E [e"^] < 

This lemma, combined with (1) immediately implies Hoeffding’s tail inequal- 
ity [31]: 

Theorem 1. Xi,...,X„ 

Xi , [ai,hi] , , t > 0 



P{S'„ -ES'„ > t} < 



P{S'„ -ES-^ < -t) < 

The theorem above is generally known as Hoeffding’s inequality. For binomial 
random variables it was proved by Chernoff [32] and Okamoto [33] . 

A disadvantage of Hoeffding’s inequality is that it ignores information about 
the variance of the X^’s. The inequalities discussed next provide an improvement 
in this respect. 

Assume now without loss of generality that EX^ = 0 for alH = 1, . . . , n. Our 
starting point is again (1), that is, we need bounds for E [e®^*]. Introduce the 
notation af = E[X?], and 



Fi = E[iA(sX,)] = ^ 

r=2 



s’'-2E[X[] 
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Also, let tp{x) = exp(a;) — x —1, and observe that ipix) < x^/2 for a; < 0 and 
ip{sx) < x‘^tp{s) for s > 0 and x G [0, 1]. Since = 1 + sx + tp{sx), we may 
write 

E = 1 + sE[A,] + E[V'(sA,)] 

= 1 + E['0(sAi)] (since E[Aj] = 0.) 

< 1 + E[t/;(s(Ai)_|_) + ’ip(—s{Xi)_)] 

(where x+ = max(0, x) and x_ = max(0, — x)) 

g2 

< 1 + E[t/^(s(Aj)+) + y(Aj)?_] (using 'ip{x) < x^/2 for x < 0. ) . 

Now assume that the A^’s are bounded such that < 1. Thus, we have 
obtained 

E < 1 +E[V^(s)(A,)^ + ^(A,)i] < 1 + ij{s)E[X^] < exp (V'(s)E[Af]) 

Returning to (1) and using the notation = (1/n) ^ erf, we get 



^ A, > t I < 



^n(7^ tp (s) — st 



Now we are free to choose s. The upper bound is minimized for 



s = log 1 + 



na^ 



Resubstituting this value, we obtain Bennett’s inequality [34]: 



Theorem 2. BENNETT’S INEQUALITY. Ai, ...,A„ 

A, < 1 



= -^Var{A,}. 



i=l 



t > 0, 



^A, >t [ <exp(-na2/r( — 



, i=l 



h{u) = (1 + m) log(l + u) — u u>0 



The message of this inequality is perhaps best seen if we do some further 
bounding. Applying the elementary inequality h{u) > u^/(2 + 2u/3), u > 0 
(which may be seen by comparing the derivatives of both sides) we obtain a 
classical inequality of Bernstein [35] : 
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Theorem 3. Bernstein’s inequality. 
e > 0, 



P 





< exp 



ne^ \ 

W^+W))' 



Bernstein’s inequality points out an interesting phenomenon: if cr^ < e, 
then the upper bound behaves like e~'^^ instead of the guaranteed by 

Hoeffding’s inequality. This might be intuitively explained by recalling that a 
Binomial(n, A/n) distribution can be approximated, for large n, by a Poisson(A) 
distribution, whose tail decreases as e~^. 



2 The Efron-Stein Inequality 

The main purpose of these notes is to show how many of the tail inequalities for 
sums of independent random variables can be extended to general functions of 
independent random variables. The simplest, yet surprisingly powerful inequality 
of this kind is known as the Efron-Stein inequality. It bounds the variance of 
a general function. To obtain tail inequalities, one may simply use Chebyshev’s 
inequality. 

Let X be some set, and let g : T" ^ R be a measurable function of n 
variables. We derive inequalities for the difference between the random variable 
Z = g{X\^ . . . , Xn) and its expected value KZ when Afi, . . . , AT„ are arbitrary 
independent (not necessarily identically distributed!) random variables taking 
values in X. 

The main inequalities of this section follow from the next simple result. To 
simplify notation, we write for the expected value with respect to the variable 
Xi, that is, V,iZ = E[Z’|Ali, . . . , W-i, ^i+i, • ■ • j Xn]. 



Theorem 4. 

n 

Var(Z) < ^E ]^{Z -EiZf 

i=l 



Proof. The proof is based on elementary properties of conditional expectation. 
Recall that if X and Y are arbitrary bounded random variables, then E[AlT] = 

E[E[xy|T]] = E[rE[x|y]]. 

Introduce the notation V = Z — EZ, and define 

Ri = E[Z|Xi, . . . , X,] - E[Z|Xi, . . . , W-i], i = I, . . . , n. 

Clearly, V = E- (Thus, V is written as a sum of martingale differences.) 
Then 
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Var(Z) = E 




2 



i>j 

n 



since, for any i > j, 



EV,Vj = EE [V^Vj\Xi, . . . , Xj] = E [VjE [y,|Xi, . . . , X^-]] = 0 . 
To bound EV^ , note that, by Jensen’s inequality. 



= {E[Z\Xi, ...,X,]~ E[Z|Xi, . . . , 



< E 
= E 



(e[e[Z|Xi,...,X„]-E[Z|Xi,...,X,_i,X,+i,...,X„] 

(E[Z|Xi,...,X„]-E[Z|Xi, . . . , X 2 — I , X^-l-l , . . . , Xyi \ ) 

2 



Xi,...,X2 

Xi,...,X2 



{z-E^zy 



X 



1 , . . . , X 2 



Taking expected values on both sides, we obtain the statement. 



□ 



Now the Efron-Stein inequality follows easily. To state the theorem, let Xy 
. . . , X '22 form an independent copy of Jfi, . . . , Xn and write 



Z'=5(Xi,...,X',...,X„) . 



Theorem 5. efron-stein inequality (efron and stein [36], steele [37]). 

1 ” 

Var(Z)< -^E[(Z-Z')2] 

i=l 



Proof. The statement follows by Theorem 4 simply by using (conditionally) 
the elementary fact that if X and Y are independent and identically distributed 
random variables, then Var(Jf) = (l/2)E[(Jf — T)^], and therefore 



E,- 



{Z-E^zy 




{Z 




□ 



Remark. Observe that in the case when Z = ^ independent 

random variables (of finite variance) then the inequality in Theorem 5 becomes 
an equality. Thus, the bound in the Efron-Stein inequality is, in a sense, not 
improvable. This example also shows that, among all functions of independent 
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random variables, sums, in some sense, are the least concentrated. Below we will 
see other evidences for this extremal property of sums. 

Another useful corollary of Theorem 4 is obtained by recalling that, for any 
random variable X, Var(A) < E[(A — a)^] for any constant a G K. Using this 
fact conditionally, we have, for every i = 1, . . . , n, 

E, [(Z-E,Z)"] <Ei [{Z-Z,f 

where Zi = gi{Xi , . . . , Aj_i, Alj+i, . . . , A„) for arbitrary measurable functions 
gi : ^ K of n — 1 variables. Taking expected values and using Theorem 4 

we have the following. 

Theorem 6. 

n 

Var(Z) <^E[(Z-Z,)2] . 
i=l 

In the next two sections we specialize the Efron-Stein inequality and its vari- 
ant Theorem 6 to functions which satisfy some simple easy-to- verify properties. 

2.1 Functions with Bounded Differences 

We say that a function g : A" ^ M has the bounded differences property if for 
some nonnegative constants ci, . . . , c„, 

sup \g{xi,. ..,Xn)- g{xi , . . . ,Xi_i,x',Xi+i, . . . ,x„)| <Ci, 1 < i < n . 

Xi , 

x'-G^ 

In other words, if we change the i-th variable of g while keeping all the others 
fixed, the value of the function cannot change by more than Cj. Then the Efron- 
Stein inequality implies the following: 

Corollary 1. ^ ^ ci,...,c„, 

n 

Var(Z) < 2 • 

i=l 

Next we list some interesting applications of this corollary. In all cases the 
bound for the variance is obtained effortlessly, while a direct estimation of the 
variance may be quite involved. 

Example, uniform deviations. One of the central quantities of statistical 
learning theory and empirical process theory is the following: let X\, . . . ,X„ be 
i.i.d. random variables taking their values in some set A, and let A be a collection 
of subsets of A. Let g denote the distribution of Xi, that is, p{A) = F{Xi G A}, 
and let /x„ denote the empirical distribution: 

1 " 
i—1 
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The quantity of interest is 

Z = sup \nn(A) - fj,{A)\. 

AeA 

If lim„^ooEZ = 0 for every distribution of the Xi’s, then A is called a uni- 
form Glivenko-Cantelli class, and Vapnik and Chervonenkis [38] gave a beautiful 
combinatorial characterization of such classes. But regardless of what A is, by 
changing one Xi, Z can change by at most 1/n, so regardless of the behavior of 
EZ, we always have 

Var(Z) < I . 

For more information on the behavior of Z and its role in learning theory 
see, for example, Devroye, Gyorfi, and Lugosi [39], Vapnik [40], van der Vaart 
and Wellner [41], Dudley [42]. 

Next we show how a closer look at the the Efron-Stein inequality implies a 
significantly better bound for the variance of Z. We do this in a slightly more gen- 
eral framework of empirical processes. Let IF be a class of real-valued functions 
and define Z = g{Xi , . . . , V„) = Assume that the functions 

f G X are such that E[/(Vi)] = 0 and take values in [—1, 1]. Let Zi be defined as 

Zi = supV/(Vj) . 

■I jAi 

Let / be the function achieving the supremum^ in the definition of Z, that 
is Z = X^r=i /(Ai) and similarly fi be such that Zi = We have 

MX,) <Z-Zi< f{x ,) , 

and thus YM=i Z — Zi < Z . ks fi and Xi are independent, Ei[/i(Vj)] = 0. On 
the other hand, 

(z - zM - fKM) = (z - z, + Mx,)){z -z,~ MXi)) 

<2{Z-Z, + MX,)) . 

Summing over all i and taking expectations. 



<nsnpE[f{Xi)] + 2E[Z] 

where at the last step we used the facts that E[/i(W)^] < supjg;pE[/^(Vi)], 
~ ^i) ^ Efi{Xi) = 0. Thus, by the Efron-Stein inequality 

Var(Z) < n sup E[f{Xi)] + 2E[Z] 

/e^ 

^ If the supremum is not attained the proof can be modified to yield the same result. 
We omit the details here. 



J2(Z-ZM <E J2ff{X,) + 2{Z-Z,) + 2Mx,) 
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From just the bounded differences property we derived Var(Z) < 2n. The new 
bound may be a significant improvement whenever the maximum of E/(Xj)^ 
over f G J- is small. (Note that if the class T is not too large, EZ is typically of 
the order of \fn^ The exponential tail inequality due to Talagrand [12] extends 
this variance inequality, and is one of the most important recent results of the 
theory of empirical processes, see also Ledoux [20], Massart [23], Rio [10], Klein 
[24], and Bousquet [27, 28]. 

Example, minimum of the empirical loss. Concentration inequalities have 
been used as a key tool in recent developments of model selection methods in 
statistical learning theory. For the background we refer to the the recent work of 
Koltchinskii and Panchenko [43], Massart [44], Bartlett, Boucheron, and Lugosi 
[45], Lugosi and Wegkamp [46], Bousquet [47]. 

Let T denote a class of {0, l}-valued functions on some space X . For sim- 
plicity of the exposition we assume that T is finite. The results remain true for 
general classes as long as the measurability issues are taken care of. Given an 
i.i.d. sample ki)) of n pairs of random variables (Xi,Yi) taking 

values in X x {0,1}, for each f G !F we define the empirical loss 

1 " 

Ln{f) = -J2emx^),Y,) 

1—1 

where the loss function £ is defined on {0,1}^ by 

^{y^ y ) ~ '^vGy' ■ 

In nonparametric classification and learning theory it is common to select an 
element of IF by minimizing the empirical loss. The quantity of interest in this 
section is the minimal empirical loss 

L = L„(/). 

Corollary 1 immediately implies that Var(L) < l/(2n). However, a more 
careful application of the Efron-Stein inequality reveals that L may be much 
more concentrated than predicted by this simple inequality. Getting tight results 
for the fluctuations of L provides better insight into the calibration of penalties 
in certain model selection methods. 

Let Z = nL and let Z' be defined as in Theorem 5, that is, 

where {X/ ,Y/) is independent of and has the same distribution as {Xi,Yi). 
Now the convenient form of the Efron-Stein inequality is the following: 

.. n n 

Var(Z) < - ^E [(Z - Z'f] = ^E [(Z - 

^ i=l i=l 



Z'i = min 
/e^ 
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Let f* denote a (possibly non-unique) minimizer of the empirical risk so that 
Z = The key observation is that 

{Z-Z')Hzi>z < {e{f*{X/),Y,') - i{r{X,),Yi)f^z'>z 
= £(r(X'),r/)f,(y.(xO.F.)=0 ■ 



Thus, 

n 

[(Z - Z[f\z'^>z] < E ^ Exi,Y,'m*{X'),Y')] < nELif*) 

i=l i:e{f-(Xi),Yi)=0 



where Ex'^y/ denotes expectation with respect to the variables X[, Y( and for 
each / G IF, L{f) = E£{f{X),Y) is the true (expected) loss of /. Therefore, the 
Efron-Stein inequality implies that 



Var(L) < 



EL(/*) 

n 



This is a significant improvement over the bound l/(2n) whenever EL(f*) is 
much smaller than 1/2. This is very often the case. For example, we have 



Lin = L- (T„(r) - Lin) <- + supw) - lm)) 



so that we obtain 



Var(L) <— + 

~ n n 

In most cases of interest, Esupjg;fr(L(/) — Ln(/)) may be bounded by a 
constant (depending on X) times (see, e.g., Lugosi [48]) and then the 

second term on the right-hand side is of the order of For exponential 

concentration inequalities for L we refer to Boucheron, Lugosi, and Massart 
[26]. 

Example, kernel density estimation. Let Xi,...,Xn be i.i.d. samples 
drawn according to some (unknown) density / on the real line. The density is 
estimated by the kernel estimate 






i=l 



X Xi 



where /i > 0 is a smoothing parameter, and iC is a nonnegative function with 
J K = 1. The performance of the estimate is measured by the Li error 



Z = giXi,...,Xu) 



\f{x) - fnix)\dx. 
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It is easy to see that 



2 

< 

n 




so without further work we get 

Var(Z) < — . 

n 



It is known that for every /, oo (see Devroye and Gyorfi [49]) which 

implies, by Chebyshev’s inequality, that for every e > 0 



P 





P{|Z-EZ| > eEZ} < 



Var(Z) 

£2(EZ)2 



0 



as n oo. That is, Z/EZ — > 0 in probability, or in other words, Z is relatively 
stable. This means that the random Li-error behaves like its expected value. 
This result is due to Devroye [50], [51]. For more on the behavior of the Li error 
of the kernel density estimate we refer to Devroye and Gyorfi [49], Devroye and 
Lugosi [52]. 



2.2 Self-Bounding Functions 

Another simple property which is satisfied for many important examples is the 
so-called self-bounding property. We say that a nonnegative function g : A" ^ K 
has the self-bounding property if there exist functions gi : ^ K such that 

for all xi, . . . , Xn & and all i = 1, . . . , n, 

0 < g{xi, . . . ,x„) - gi{xi,. . .,Xi-i,Xi+i, . . . ,x„) < 1 



and also 

71 

{g{xi, . . . , x„) - gi{xi, . . . , Xi-i, Xi+i, . . . , a;„)) < g{xi, . . . , a;„) . 

i=l 

Goncentration properties for such functions have been studied by Boucheron, 
Lugosi, and Massart [25], Rio [10], and Bousquet [27, 28]. For self-bounding 
functions we clearly have 

n 

^{g{xl,...,xn)-g^ixr , . . . , Xi—±j ■ 5 ))^ < g{Xi,...,Xn) ■ 

i=l 

and therefore Theorem 6 implies 

Corollary 2. g 



Var(Z) < EZ . 
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Next we mention some applications of this simple corollary. It turns out that 
in many cases the obtained bound is a significant improvement over what we 
would obtain by using simply Corollary 1. 

Remark, relative stability. Bounding the variance of Z by its expected 
value implies, in many cases, the relative stability of Z . A sequence of non- 
negative random variables (Z„) is said to be relatively stable if Z„/EZ„ ^ 1 
in probability. This property guarantees that the random fluctuations of 
around its expectation are of negligible size when compared to the expectation, 
and therefore most information about the size of is given by EZ„. If has 
the self-bounding property, then, by Chebyshev’s inequality, for all e > 0, 



P 






^ Var(Z„) ^ 1 

- e2(EZ„)2 - £2 ez„ • 



Thus, for relative stability, it suffices to have EZ„ ^ oo. 

Example, rademacher averages. A less trivial example for self-bounding 
functions is the one of Rademacher averages. Let IF be a class of functions 
with values in [— 1, 1]. If ui, . . . , cr„ denote independent symmetric {—1, l}-valued 
random variables, independent of the A^’s (the so-called Rademacher random 
variables), then we define the conditional Rademacher average as 



Z = E 



sup}_^a,/(A,)|A(‘ 



where the notation Af is a shorthand for Ai, . . . , A„. Thus, the expected value 
is taken with respect to the Rademacher variables and Z is a function of the A^’s. 
Quantities like Z have been known to measure effectively the complexity of model 
classes in statistical learning theory, see, for example, Koltchinskii [53], Bartlett, 
Boucheron, and Lugosi [45], Bartlett and Mendelson [54], Bartlett, Bousquet, 
and Mendelson [55] . It is immediate that Z has the bounded differences property 
and Corollary 1 implies Var(Z) < n/2. However, this bound may be improved 
by observing that Z also has the self-bounding property, and therefore Var(Z) < 
EZ. Indeed, defining 



Zi = E 



sup^a,/(A,)|Ar 

3 =^ 1 - 



it is easy to see that Q < Z — Zi < 1 and ~ ^i) ^ ^ (the details are 

left as an exercise). The improvement provided by Lemma 2 is essential since it 
is well-known in empirical process theory and statistical learning theory that in 
many cases when IF is a relatively small class of functions, EZ may be bounded 
by something like where the constant C depends on the class IF, see, e.g., 

Vapnik [40], van der Vaart and Wellner [41], Dudley [42]. 
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Configuration Functions. An important class of functions satisfying the self- 
bounding property consists of the so-called conGguration functions defined by 
Talagrand [11, section 7]. Our definition, taken from [25] is a slight modification 
of Talagrand’s. 

Assume that we have a property P defined over the union of finite products 
of a set X, that is, a sequence of sets Pi G X,P 2 G X x X,. . . ,Pn G A”. We say 
that (xi, . . . Xm) G A™ satisfies the property P if (xi, . . . Xm) G Pm- We assume 
that P is hereditary in the sense that if (xi, . . . Xm) satisfies P then so does any 
subsequence (xj^, . . . x^j.) of (xi,...Xm). The function g„ that maps any tuple 
(xi, . . . x„) to the size of the largest subsequence satisfying P is the configuration 
function associated with property P. 

Corollary 2 implies the following result: 

Corollary 3. , Z = 5 „(Xi, . . . , A„), 

- Al, . . . , A„ t > 0, 



Var(Z) < EZ . 

Proof. By Corollary 2 it suffices to show that any configuration function is 
self bounding. Let Z^ = gn-i{Xi, . . . , Ai_i, Aj+i, . . . , A„). The condition 0 < 
Z — Zj < 1 is trivially satisfied. On the other hand, assume that Z = k and 
let {Aij, . . . , Ajj,} C {Al, . . . , A„} be a subsequence of cardinality k such that 
= k. (Note that by the definition of a configuration function 
such a subsequence exists.) Clearly, if the index i is such that i ^ (A, ■ • ■ ,ik} 
then Z = Zi, and therefore 



^(Z-Z,)<z 

i=l 

is also satisfied, which concludes the proof. □ 

To illustrate the fact that configuration functions appear rather naturally in 
various applications, we describe a prototypical example: 

Example, vc dimension. One of the central quantities in statistical learning 
theory is the Vapnik-Chervonenkis dimension, see Vapnik and Chervonenkis [38, 
56], Blumer, Ehrenfeucht, Haussler, and Warmuth [57], Devroye, Gyorfi, and 
Lugosi [39], Anthony and Bartlett [58], Vapnik [40], etc. 

Let A be an arbitrary collection of subsets of A, and let x” = (xi, . . . ,x„) 
be a vector of n points of A. Define the trace of A on x" by 

tr(xi ) = (A n (xi, . . . , x„} : A G A} . 

The shatter coefficient, (or Vapnik-Chervonenkis growth function) of A in 
x" is T(xi) = |tr(xi)|, the size of the trace. T(xi) is the number of different 
subsets of the n-point set |xi,...,x„} generated by intersecting it with ele- 
ments of A. A subset (x^^, . . . ,Xij.} of |xi,...,x„} is said to be shattered if 
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2 k _ T{xi^, . . . The VC dimension D(xi) of A (with respect to a;") is the 

cardinality k of the largest shattered subset of x^. From the definition it is obvi- 
ous that 5 „(a;”) = D{xi) is a configuration function (associated to the property 
of “shatteredness” , and therefore if Xi , . . . , X„ are independent random vari- 
ables, then 

Var(D(Xn) < E£>(Xi”) . 



3 The Entropy Method 

In the previous section we saw that the Efron-Stein inequality serves as a pow- 
erful tool for bounding the variance of general functions of independent random 
variables. Then, via Chebyshev’s inequality, one may easily bound the tail prob- 
abilities of such functions. However, just as in the case of sums of independent 
random variables, tail bounds based on inequalities for the variance are often 
not satisfactory, and essential improvements are possible. The purpose of this 
section is to present a methodology which allows one to obtain exponential tail 
inequalities in many cases. The pursuit of such inequalities has been an impor- 
tant topics in probability theory in the last few decades. Originally, martingale 
methods dominated the research (see, e.g., McDiarmid [2, 3], Rhee and Tala- 
grand [59], Shamir and Spencer [60]) but independently information-theoretic 
methods were also used with success (see Alhswede, Gacs, and Korner [4], Mar- 
ton [5, 6, 7], Dembo [8], Massart [9], Rio [10], and Samson [61]). Talagrand’s 
induction method [11, 12, 13] caused an important breakthrough both in the 
theory and applications of exponential concentration inequalities. In this section 
we focus on so-called “entropy method”, based on logarithmic Sobolev inequal- 
ities developed by Ledoux [20, 21], see also Bobkov and Ledoux [22], Massart 
[23], Rio [10], Boucheron, Lugosi, and Massart [25], [26], and Bousquet [27, 28]. 
This method makes it possible to derive exponential analogues of the Efron-Stein 
inequality perhaps the simplest way. 

The method is based on an appropriate modification of the “tensorization” 
inequality Theorem 4. In order to prove this modification, we need to recall some 
of the basic notions of information theory. To keep the material at an elementary 
level, we prove the modified tensorization inequality for discrete random variables 
only. The extension to arbitrary distributions is straightforward. 

3.1 Basic Information Theory 

In this section we summarize some basic properties of the entropy of a discrete- 
valued random variable. For a good introductory book on information theory we 
refer to Cover and Thomas [62]. 

Let A be a random variable taking values in the countable set X with dis- 
tribution P{A = x} = p{x), X G X. The entropy of X is defined by 

H{X) = E[- logp(A)] = ~Y^ p(x) logp(x) 
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(where log denotes natural logarithm and 0 log 0 = 0) . If X, y is a pair of discrete 
random variables taking values in A” x 3^ then the joint entropy H{X,Y) of X 
and Y is defined as the entropy of the pair {X,Y). The conditional entropy 
H{X\Y) is defined as 

H{X\Y) = H{X,Y) - H{Y) . 

Observe that if we write p{x,y) = P{Jf = x,Y = y} and p{x\y) = P{X = 
x\Y = y} then 

H(X\Y) = - y: P{x,y) logp{x\y) 

xex,yey 

from which we see that H{X\Y) > 0. It is also easy to see that the defining 
identity of the conditional entropy remains true conditionally, that is, for any 
three (discrete) random variables X, T, Z, 

H{X, Y\Z) = H{Y\Z) + H{X\Y, Z) . 

(Just add H{Z) to both sides and use the definition of the conditional en- 
tropy.) A repeated application of this yields the chain rule for entropy: for arbi- 
trary discrete random variables Xi, , X„, 

iJ(Ai, . . . , A„) = i?(Ai)+i/(A2|Ai)+i/(A3|Ai, A2)+- • -+i/(A„|Ai, . . . ,A„_i). 



Let P and Q be two probability distributions over a countable set X with 
probability mass functions p and q. Then the Kullback-Leibler divergence or 
relative entropy of P and Q is 

D{P\\Q) = ■ 

ni T. 



Since log a; < a; — 1, 



D{P\\Q) 



P{x) log 



p{x) 



xex 




= 0 , 



so that the relative entropy is always nonnegative, and equals zero if and only if 
P = Q. This simple fact has some interesting consequences. For example, if X is 
a finite set with N elements and X is a random variable with distribution P and 
we take Q to be the uniform distribution over X then D{P\\Q) = log N — H{X) 
and therefore the entropy of X never exceeds the logarithm of the cardinality of 
its range. 

Consider a pair of random variables A, Y with joint distribution Px,y and 
marginal distributions Px and Py. Noting that D{Px,y\\Px x Py) = H{X) ~ 
H{X\Y), the nonnegativity of the relative entropy implies that H{X) > Pl{X\Y), 
that is, conditioning reduces entropy. It is similarly easy to see that this fact re- 
mains true for conditional entropies as well, that is. 



H{X\Y) > H{X\Y,Z) . 
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Now we may prove the following inequality of Han [63] 

Theorem 7. han’s inequality. Xi,...,Xn 

1 " 

H{Xi,...,X^) < 

n — 1 

i=l 

Proof. For any i = 1, . . . ,n, by the definition of the conditional entropy and 
the fact that conditioning reduces entropy, 

H{Xi,...,X„) 

= H{Xi,...,X,_i,X,+i,...,X^) + H{Xi\Xi,...,Xi_i,Xi+i,...,Xn) 
<H{Xi,...,X,_i,X,+i,...,X^) + H{Xi\Xi,...,Xi_i) i = 

Summing these n inequalities and using the chain rule for entropy, we get 

n 

i=l 

which is what we wanted to prove. □ 

We finish this section by an inequality which may be regarded as a version 
of Han’s inequality for relative entropies. As it was pointed out by Massart [44], 
this inequality may be used to prove the key tensorization inequality of the next 
section. 

To this end, let A be a countable set, and let P and Q be probability distri- 
butions on A" such that P = Pi x • • • x is a product measure. We denote the 
elements of A” by x" = (xi, . . . , x„) and write x*^*^ = (xi, . . . , Xi_i, xi+i, . . . , x„) 
for the (n — l)-vector obtained by leaving out the t-th component of x" . Denote 
by and P*^*^ the marginal distributions of x" according to Q and P, that is, 

g(^)(a:) = ^ Q(xi, . . . 

and 

p(^)(a:) = ^ P{xi , . . . . . . ,Xn) 

= ^ Pi{xi) ■ ■ ' Pi_i{Xi_i)Pi{x)Pi^l{Xi^l) ' ■ ■ Pn{Xn) ■ 

Then we have the following. 

Theorem 8. han’s inequality for relative entropies. 

n 

D{Q\\P) > -^P(g«l]P«) 

n — I 

7=1 

n 

omp) < E {p(Q\\p) - D{Q^^'>\\p^^'>)) . 

7=1 
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Proof. The statement is a straightforward consequence of Han’s inequality. 
Indeed, Han’s inequality states that 

n 

i=l a;(OgA’"-i 

Since 



D{Q\\P)= E gK)iogQK)- E QK)iogPK) 



and 

= E (g^*^a;W)logQW(a;(*))-QW(x(*))logP(*)(xW)) , 



it suffices to show that 

n 

E g(^i)iogp(x?;*) = ^^E E . 

x’lGX'^ i=l sCOgA”"-! 

This may be seen easily by noting that by the product property of P, we have 
P{xi) = for all i, and also P(a;”) = 0”=! and therefore 

E QK)logP(:r5^)= ^E E gK)(logP<*^xW)+logPi(:r.)) 

x'^^X'^ i—1 x'^l^X^ 

1 1 

= -E E gK)iogp'*^xW) + -g(:r^)iogp(xr) . 

i—1 x'^^X^ 

Rearranging, we obtain 

n 

E gw)iogp(x^) = — E E g(^i)iogp«(xW) 

x'^^X'^ i—1 x'^^X'^ 

= ^— E E g^*^(a;^*^)iogp(*Hx(*)) 

where we used the defining property of □ 

3.2 Tensorization of the Entropy 

We are now prepared to prove the main exponential concentration inequalities 
of these notes. Just as in Section 2, we let X\,. . . ,X„ be independent random 
variables, and investigate concentration properties of Z = g{Xi, . . . , Xn). The 
basis of Ledoux’s entropy method is a powerful extension of Theorem 4. Note 
that Theorem 4 may be rewritten as 
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Var(Z) < [Ei(Z2) - {E,{Z))^] 



or, putting 4>{x) = x'^, 



E^(Z) - </)(EZ) < ^E[E,(/)(Z) - <^(E,(Z))] . 

i=l 

As it turns out, this inequality remains true for a large class of convex func- 
tions (j), see Beckner [64], Latala and Oleszkiewicz [65], Ledoux [20], Boucheron, 
Bousquet, Lugosi, and Massart [29], and Chafai [66]. The case of interest in our 
case is when 4>{x) = x log x. In this case, as seen in the proof below, the left- 
hand side of the inequality may be written as the relative entropy between the 
distribution induced by Z on A" and the distribution of A". Hence the name 
“tensorization inequality of the entropy”, (see, e.g., Ledoux [20]). 

Theorem 9. <p{x) = xlogx x>0 Ai...,A„ 

X f A” 

y = /(Ai,...,A„). . , 



E4>{Y) - 0(EA) < ^E [E,0(F) - <(.(E,(y))] . 

i=l 

Proof. We only prove the statement for discrete random variables Ai . . . , A„. 
The extension to the general case is technical but straightforward. The theorem 
is a direct consequence of Han’s inequality for relative entropies. First note that 
if the inequality is true for a random variable Y then it is also true for cY where 
c is a positive constant. Hence we may assume that EA = 1. Now define the 
probability measure Q on A” by 

Q{x-,) = f{x-,)P{x-,) 

where P denotes the distribution of A" = Xi, , A„. Then clearly, 

E(j){Y) - (j){EY) = E[riogA] = D{Q\\P) 

which, by Theorem 8, does not exceed ||P*'*^)) . How- 

ever, straightforward calculation shows that 

71 n 

{d{Q\\P) - D{Q(^^\\P^^^)) = ^E[E,</)(A) - </>(E,(A))] 

i=l i=l 

and the statement follows. □ 

The main idea in Ledoux’s entropy method for proving concentration in- 
equalities is to apply Theorem 9 to the positive random variable Y = e®^. Then, 
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denoting the moment generating function of Z by F{s) = the left-hand 

side of the inequality in Theorem 9 becomes 

sE [Ze^^] - E [e"^] logE [e®^] = sF'(s) - F{s) logF(s) . 

Our strategy, then is to derive upper bounds for the derivative of F{s) and 
derive tail bounds via Chernoff’s bounding. To do this in a convenient way, we 
need some further bounds for the right-hand side of the inequality in Theorem 9. 
This is the purpose of the next section. 



3.3 Logarithmic Sobolev Inequalities 

Recall from Section 2 that we denote Zi = gi{Xi , . . . , . . . , Xn) where 

gi is some function over Below we further develop the right-hand side of 

Theorem 9 to obtain important inequalities which serve as the basis in deriving 
exponential concentration inequalities. These inequalities are closely related to 
the so-called logarithmic Sobolev inequalities of analysis, see Ledoux [20, 67, 68], 
Massart [23]. 

First we need the following technical lemma: 

Lemma 2 . Y u > 0, 

E[T log r] - (ET) log(ET) < E[T log T - T log u - (T - u)] . 




Theorem 10. A logarithmic sobolev inequality. j/j(x) = e^ — x — 

1 

71 

sE [Ze"^] - E [e®^] log E [e"^] < ^ E {~s(.Z - Z,))] . 

i=l 

Proof. We bound each term on the right-hand side of Theorem 9. Note that 
Lemma 2 implies that if Yi is a positive function of X \, . . . , W-i, ^i+i, • ■ • 7 Xn, 
then 

E,(T logT) - E,(T) logE,(T) < E, [T(logr - logTi) - (T - T,)] 

Applying the above inequality to the variables Y = and Tj = one 
gets 

E,(yiogy)-Ei(T)logE,(r) <Ei [e^^V’(-s(Z-ZW)) 
and the proof is completed by Theorem 9. □ 
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The following symmetrized version, due to Massart [23], will also be useful. 
Recall that Z[ = g{Xi, . . . ,X-, . . . , X^) where the X' are independent copies of 
the Xi. 

Theorem 11. symmetrized logarithmic sobolev inequality. ip . 



sE [Ze"^] - E [e"^] logE [e®^] < ^E {~s{Z - Z'))] . 

i^l 

r{x) = x{e^ — 1) , 5 G M, 

n 

sE [Ze*^] - E [e*^] logE [e*^] < ^E [e^^r(-s(Z - Z'))1z>z'] , 

n 

sE [Ze®^] - E [e"^] logE [e®^] [e"^r(s(Z' - Z))^z<z^] ■ 

i=l 

Proof. The first inequality is proved exactly as Theorem 10, just by noting 
that, just like Zj, Z' is also independent of Xi. To prove the second and third 
inequalities, write 

(-s(Z - Z')) = (-s(Z - Z')) 1 (s(Z' - Z))^z<zi ■ 

By symmetry, the conditional expectation of the second term may be written 
as 

E, [e^^i;{s{Z'-Z))^z<z’\ = E, {s{Z - Z')) t z>z'^ 

= E, (s(Z - Z')) ^z>z', ■ 

Summarizing, we have 

E[e*^V'(-s(^-^'))] 

= E, (-s(Z - Z')) + (s(Z - Z'))) . 

The second inequality of the theorem follows simply by noting that 
ip{x) + e^tp{—x) = x{e^ — 1) = t{x). The last inequality follows similarly. □ 

3.4 First Example: Bounded Differences and More 

The purpose of this section is to illustrate how the logarithmic Sobolev inequal- 
ities shown in the previous section may be used to obtain powerful exponential 
concentration inequalities. The first result is rather easy to obtain, yet it turns 
out to be very useful. Also, its proof is prototypical, in the sense that it shows, 
in a transparent way, the main ideas. 




228 



S. Boucheron et al. 



Theorem 12. C 

n 

<c. 

i^l 

t > 0 

P[|Z-EZ| >t]< . 

Proof. Observe that for x > 0, r(—x) < x^, and therefore, for any s > 0, 
Theorem 11 implies 

n 

sE [Ze"^] -E [e"^] logE [e®^] <E ~ z>z' 

i^l 

n 

<s^E 

i^l 

<s2CE[e"^] , 

where at the last step we used the assumption of the theorem. Now denoting the 
moment generating function of Z by i^(s) = E [e^^] , the above inequality may 
be re-written as 

sF'{s) — F{s)log F{s) < Cs‘^F{s) . 

After dividing both sides by s‘^F{s), we observe that the left-hand side is just 
the derivative of H{s) = s“^log A(s), that is, we obtain the inequality 

H'{s) < C . 

By I’Hospital’s rule we note that H{s) = F'{0)/F{0) = EZ, so by 

integrating the above inequality, we get ff(s) < EZ + sC, or in other words, 

F{s) < . 

Now by Markov’s inequality, 

¥[Z >EZ + t]< . 

Choosing s = t/2C, the upper bound becomes . Replace Z by — Z to 

obtain the same upper bound for F[Z < EZ — t]. □ 

Remark. It is easy to see that the condition of Theorem 12 may be relaxed in 
the following way: if 

n 

E Y.{Z-Z')Hz>z'\^ <c 

then for allt > 0, 

E[Z >EZ + t]< 




Concentration Inequalities 



229 



and if 



then 



E 



.i=l 



X 



< c , 



P [Z < EZ - t] < . 



An immediate corollary of Theorem 12 is a subgaussian tail inequality for 
functions of bounded differences. 

Corollary 4. bounded differences inequality. g 

^ Cl , . . . , Cfi , 

P[|Z-EZ| > t] < 

We remark here that the constant appearing in this corollary may be im- 
proved. Indeed, using the martingale method, McDiarmid [2] showed that under 
the conditions of Corollary 4, 

P[|Z-EZ| > t] < 2e-2*"/‘^ 

(see the exercises). Thus, we have been able to extend Corollary 1 to an expo- 
nential concentration inequality. Note that by combining the variance bound of 
Corollary 1 with Chebyshev’s inequality, we only obtained 

P[|Z-EZ|>t]<^ 

and therefore the improvement is essential. Thus the applications of Corollary 1 
in all the examples shown in Section 2.1 are now improved in an essential way 
without further work. 

However, Theorem 12 is much stronger than Corollary 4. To understand why, 
just observe that the conditions of Theorem 12 do not require that g has bounded 
differences. All that’s required is that 

n n 

sup '^\ g { xi ,..., Xn ) - g { xi ,..., Xi - i , Xi , Xi + i ,..., Xn )\'^ < Vci , 
an obviously much milder requirement. 



3.5 Exponential Inequalities for Self-Bounding Functions 

In this section we prove exponential concentration inequalities for self-bounding 
functions discussed in Section 2.2. Recall that a variant of the Efron-Stein in- 
equality (Theorem 2) implies that for self-bounding functions Var(Z) < E(Z). 
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Based on the logarithmic Sobolev inequality of Theorem 10 we may now obtain 
exponential concentration bounds. The theorem appears in Boucheron, Lugosi, 
and Massart [25] and builds on techniques developed by Massart [23] . 

Recall the definition of following two functions that we have already seen in 
Bennett’s inequality and in the logarithmic Sobolev inequalities above: 

/i (m) = (1 + m) log (1 + m) — M (m > — 1), 
and tp{v) = sup [uv — h{u)] = — v — 1 . 

U>-1 



Theorem 13. g 

s G K, 

logE <EZ'0(s) . 

t > 0, 

¥ [Z > TKZ + t] < exp —EZh 
0<t<EZ, 

F [Z < EZ — t] < exp —EZh 





By recalling that h{u) > u^/{2 + 2u/3) for u > 0 (we have already used this 
in the proof of Bernstein’s inequality) and observing that h{u) > m^/ 2 for m < 0, 
we obtain the following immediate corollaries: for every t > 0, 



F [Z > EZ + t] < exp — 



and for every 0 < t < EZ, 



F[Z < EZ — t] < exp 




Proof. We apply Lemma 10. Since the function ip is convex with ip (0) = 0, for 
any s and any u G [0, 1] , ip{—su) < uip{—s). Thus, since Z — Zi & [0, 1], we have 
that for every s, ip{—s {Z — Z^)) < (Z — Z^) ip{—s) and therefore. Lemma 10 and 
the condition ~ ^i) — ^ imply that 

71 

sE [Ze^^] - E [e"^] logE [e"^] < E ip{-s)e‘^^ XI 

< ip{-s)E [Ze^^] . 

Introduce Z = Z — E[Z] and define, for any s, F{s) = E . Then the 
inequality above becomes 

[s - ipi-s)] - logF(s) < EZip{-s) , 

F{s) 
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which, writing G(s) = log implies 

(1 - e“") G' (s) - G (s) < EZi; (-s) . 

Now observe that the function Gq = EZip is a solution of the ordinary dif- 
ferential equation (1 — e“^) G' (s) — G (s) = EZ^p (— s). We want to show that 
G < Gq. In fact, if Gi = G — Gq, then 

(1 - e-*) G; (s) - Gi (s) < 0. (2) 

Hence, defining G(s) = Gi (s) /(e® — 1), we have 

(1 - e"®) (e® - 1) G'(s) < 0. 



Hence G' is non-positive and therefore G is non-increasing. Now, since Z is 
centered G[ (0) = 0. Using the fact that s(e® — 1)“^ tends to 1 as s goes to 0, we 
conclude that G(s) tends to 0 as s goes to 0. This shows that G is non-positive 
on (0,oo) and non-negative over (— oo,0), hence Gi is everywhere non-positive, 
therefore G < Go and we have proved the first inequality of the theorem. The 
proof of inequalities for the tail probabilities may be completed by Chernoff’s 
bounding: 



F[Z — E[Z] > t] < exp 



— sup {ts — EZil; (s)) 

s>0 



and 



P [Z — E [Z] < —t] < exp 



— sup (— ts — EZV' (s)) . 

s<0 



The proof is now completed by using the easy-to-check (and well-known) 
relations 



sup [ts — EZxp (s)] = EZh {t/EZ) for t > 0 

s>0 

sup [—ts — EZ'ip{s)] = EZh{—t/EZ) for 0 < t < EZ. 

s<0 

□ 



3.6 VC Entropy 

Theorems 2 and 13 provide concentration inequalities for functions having the 
self-bounding property. In Section 2.2 several examples of such functions are 
discussed. The purpose of this section is to show that the so-called VC entropy 
is a self-bounding function. 

The Vapnik-Chervonenkis (or vc) entropy is closely related to the VC dimen- 
sion discussed in Section 2.2. Let A be an arbitrary collection of subsets of X, 
and let x” = (xi, . . . , x„) be a vector of n points of X. Recall that the shatter 
coefficient is defined as the size of the trace of ^ on x", that is, 

T(x”) = |tr(x”)| = |{Hn{xi,...,x„} : AeA}\ . 
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The VC entropy is defined as the logarithm of the shatter coefficient, that is, 



H^i) = log^Tix^) . 



Lemma 3. VC 

Proof. We need to show that there exists a function /i' of n — 1 variables 
such that for all f writing = (xi, . . . , Xj_i, . . . , x„), 0 < 

< 1 and 

n 

^(/rK)-/.'(xW)) </rK). 

i=l 

We define h' the natural way, that is, as the entropy based on the n — 1 points 
in its arguments. Then clearly, for any i, < /i(x”), and the difference 

cannot be more than one. The nontrivial part of the proof is to show the second 
property. We do this using Han’s inequality (Theorem 7). 

Consider the uniform distribution over the set tr(x"). This defines a random 
vector Y = {Yi, ... ,Yn) G . Then clearly, 

h{x^) = log 2 |tr(x(*)(x)| = ^i7(ri,...,y„) 

where H{Yi, . . . , Y„) is the (joint) entropy of Yi, . . . , Y„. Since the uniform dis- 
tribution maximizes the entropy, we also have, for all i < n, that 

h'{x^^'>) > ^i7(ri,...,r,_i,y,+i,...,r„). 

In 2 

Since by Han’s inequality 

1 " 

i^(Yi,...,r„) < -^i7(Yi,...,Y,_i,ri+i,...,Y„), 

71—1 

i—\ 

we have 

n 

^ (h{xr) - /r'(xW)) < h{xr) 

i=\ 

as desired. □ 



The above lemma, together with Theorems 2 and 12 immediately implies the 
following: 



Corollary 5. 

T Z = h(XJ^) 

E[Z], t > 0 

¥[Z > EZ -I- 1] < exp 
0 < t < EZ 



VC 

1 

2EZ + 2t/3\ ’ 



Var(Z) < 
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P [Z < EZ — t] < exp 

Elog2 T(xn < log2 ET{X^) < log2 eElog2 T{X^) . 

Note that the left-hand side of the last statement follows from Jensen’s in- 
equality, while the right-hand side by taking s = In 2 in the first inequality of The- 
orem 13. This last statement shows that the expected vc entropy Elog 2 T(X") 
and the annealed vc entropy are tightly connected, regardless of the class of sets 
A and the distribution of the Xi’s. We note here that this fact answers, in a 
positive way, an open question raised by Vapnik [69, pages 53-54]: the empirical 
risk minimization procedure is and if 

and only if the annealed entropy rate (1/n) log 2 E[T(Jf)] converges to zero. For 
the definitions and discussion we refer to [69]. 



2EZ 



3.7 Variations on the Theme 

In this section we show how the techniques of the entropy method for proving 
concentration inequalities may be used in various situations not considered so 
far. The versions differ in the assumptions on how ~ is controlled 

by different functions of Z. For various other versions with applications we refer 
to Boucheron, Lugosi, and Massart [26]. In all cases the upper bound is roughly 
of the form where is the corresponding Efron-Stein upper bound on 

Var(Z). The first inequality may be regarded as a generalization of the upper 
tail inequality in Theorem 13. 

Theorem 14. a b 

71 

Y^{Z-Z'^Hz>zi<aZ + b . 

i=l 

s G (0, 1/a), 

logE[exp(s(Z-E[Z]))] < {aEZ + b) 

t > 0, 

P {Z > EZ -|- < exp ( ;;;; ; ^ . 

^ ’ \4aEZ + Ab+ 2at ) 



Proof. Let s > 0. Just like in the first steps of the proof of Theorem 12, we use 
the fact that for x > 0, t{—x) < and therefore, by Theorem 11 we have 



sE [Ze^^] - E [e®^] logE < E 



e^^Y.^Z - Z[f\zyz[ 



< (aE [Ze"^] -k&E [e®^]) , 



where at the last step we used the assumption of theorem. 
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Denoting, once again, F{s) = E [e®^] , the above inequality becomes 

sF'{s) — F{s) log F{s) < as^F'(s) + bs^F{s) . 

After dividing both sides by s^A(s), once again we see that the left-hand side 
is just the derivative of F[{s) = s~^ log F{s), so we obtain 

H'{s) < a(log F(s))' -I- b . 

Using the fact that Hmg^o H{s) = F'{0)/F{0) = EZ and logU(O) = 0, and 
integrating the inequality, we obtain 

H{s) < EZ + alogF{s) + bs , 

or, if s < 1 /a, 

logE[s(Z-E[Z])] < — (aEZ-hfe) , 

proving the first inequality. The inequality for the upper tail now follows by 
Markov’s inequality and the following technical lemma whose proof is left as an 
exercise. □ 

Lemma 4. C a ^ . hi{x) = 

1 -I- a; — \/l + 2x 

C\^ \ 2C f at\ ^ 
l-aXj ~ ~cF ^ \ ^J - 2{2C + at) 



sup 

AG[0,l/a) 




a 




t < C/a 



sup 

Ag[0,oo) 




cx^ \ 

\ + ax) 






There is a subtle difference between upper and lower tail bounds. Bounds 
for the lower tail P{Z < EZ — t] may be easily derived, due to Chebyshev’s 
association inequality which states that if A is a real-valued random variable 
and / is a nonincreasing and 5 is a nondecreasing function, then 



E[f{X)g{X)]<E[f{X)]E[g{X)]\ . 
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Theorem 15. 9, 

n 

Y,{z-z[)Hz<zi<g{z) . 

t > 0, 

P[Z<EZ-t]<exp(j^^) , 

Proof. To prove lower-tail inequalities we obtain upper bounds for F(s) = 
E[exp(sZ)] with s < 0. By the third inequality of Theorem 11, 

sE [Ze"^] - E [e"^] logE [e*^] 

n 

<Y,E[e^^r{s{Z'-Z))^z<zi] 

n 

<J2^[e^^s^{Z'-Z)Hz<z^] 

(using s < 0 and that t{—x) < for a; > 0) 

n 

= s^E e^^Y.^Z - Z[)Hz<z', 

<s^E[e^^g{Z)] . 

Since g{Z) is a nondecreasing and is a decreasing function of Z, Cheby- 
shev’s association inequality implies that 

E [e^^g{Z)] < E [e*^] E[g{Z)] . 

Thus, dividing both sides of the obtained inequality by s'^F(s) and writing 
F[{s) = {l/s)logF{s), we obtain 

H'{s) < E[g{Z)\ . 

Integrating the inequality in the interval [s,0) we obtain 
F{s) < exp{s'^E[g{Z)] + sE[Z]) . 

Markov’s inequality and optimizing in s now implies the theorem. □ 

The next result is useful when one is interested in lower-tail bounds but 
J:tAz-z'r^z<z',is difficult to handle. In some cases Y^^=i(Z — Z'?^z>z', 
is easier to bound. In such a situation we need the additional guarantee that 
\Z — Zl\ remains bounded. Without loss of generality, we assume that the bound 
is 1. 
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Theorem 16. g 

T^Uiz - Zl)nz>zi < g(z) X/. iz - z'l < i 

K >0se [0,1/if) 

logE[exp(-s(Z-E[Z]))] < s^^Elg(Z)] , 
t>0. . t< (e- 1 )E[ 5 (Z)] ^ , 

P[Z<EZ-<|<=p(-j|— . 



Proof. The key observation is that the function t{x)/x^ = (e^ — l)/x is in- 
creasing if a: > 0. Choose K > 0. Thus, for s G (— 1/if, 0), the second inequality 
of Theorem 1 1 implies that 



sE [Ze^^] - E [e"^] logE [e"^] < ^E \e^^T{-s{Z - Z^^'^))^z>z' 






< S' 



< S 



t{K) 

if2 

2t{K) 



E 



e^^Y.{Z - Z^^^)Hz>zi 



if2 



E [g{Zy 






where at the last step we used the assumption of the theorem. 

Just like in the proof of Theorem 15, we bound E [g(Z)e^'^] by E[g{Z)\E [e^^] ■ 
The rest of the proof is identical to that of Theorem 15. Here we took if = 1. □ 



Finally we give, without proof, an inequality (due to Bousquet [28]) for 
functions satisfying conditions similar but weaker than the self-bounding con- 
ditions. This is very useful for suprema of empirical processes for which the 
non-negativity assumption does not hold. 

Theorem 17. Z Z — Zi < Z _ 

Yi , , i = 1, . . . ,n, Yi < Z — Zi < l Yi < a 

a > 0 Ejli > 0 , (7^ , , 

n 

n 

1 

t > 0, 



P {Z > EZ -l- t} < exp ri/i 

ri = (1 -I- a)EZ + na^ 

An important application of the above theorem is the following version of 
Talagrand’s concentration inequality for empirical processes. The constants ap- 
pearing here were obtained by Bousquet [27] . 





Corollary 6. 

sup / < 1 



T 
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Ef{Xi) = 0 



a 

t>0. ^ 



n 

Z = sup ^ f{Xi) . 

ncr^ > X;r=iSup/g^E[/2(X,)], 



P {Z > EZ + i} < exp 



—vh 




V = na"^ + 2EZ 
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